Report - Catalog of International Big Data Science Programs
Transcription
Report - Catalog of International Big Data Science Programs
Report on International Data Exchange Requirements Authors: Jill Gemmill Geoffrey Fox, Stephen Goff, Sara Graves, Mike L. Norman, Beth Plale, Brian Tierney Report Advisory Committee Jim Bottum, Clemson University Bill Clebsch Stanford University Cliff Jacobs, Cliff Jacobs LLC Geoffrey Fox, Indiana University Stephen Goff, University of Arizona Sara Graves, University of Alabama-Huntsville Steve Huter, University of Oregon Miron Livny, University of Wisconsin Marla Meehl, UCAR Ed Moynihan, Internet2 Mike Norman, San Diego Supercomputer Center Beth Plale, Indiana University Brian Tierney, ESnet NSF Support ACI-1223688 December 2014 1 1. EXECUTIVE SUMMARY The National Science Foundation (NSF) contributes to the support of U.S. international science policy and provides core cyberinfrastructure through the International Research Network Connections (IRNC) program. The annual investment in 2014 is estimated at $7M/year and provides international network services supporting U.S. research, scholarship and collaboration. Given the rapid growth of data driven science, new instrumentation, and multi-disciplinary collaboration, NSF funded a study to estimate the flow of data over the international networks in the year 2020. This estimation will inform the likely scale of network requirements and the level of investment needed to support international network requirements for data exchange in the next five years. FIGURE1.INTERACTIVEWEBSITEATHTTP://IRNC.CLEMSON.EDU Methods used to gather information to construct a likely scenario of the future needs included in person interviews with IRNC providers, participation in science domain conferences, review of available network utilization measurements, comparison to a commodity Internet prediction study, 2 survey of NSF funded principal investigators, input from science domain experts, on-line search, and advice from the study advisory committee. In addition to this written report, the study also produced the “Catalog of International Big Data Science”, an interactive and updatable website available at http://irnc.clemson.edu (Figure 1). 1.1. KEY OBSERVATIONS 1) The IRNC networks are demonstrably critical for scientific research, education, and collaboration. NSF’s $7M annual investment is highly leveraged by a factor of 10-15 due to contributions of other nations to the global R&E network infrastructure. (Section 3) The IRNC networks are distinguished from the commodity Internet in having an extremely large data driven traffic flow; the commodity Internet is 79-90% video driven. (Section 5.3.2) 56% of NSF funded scientists (all disciplines funded by NSF) participate in international collaborations. (Appendix C, Question 1) IRNC network capacity (bandwidth available) has been keeping pace with national R&E backbone speeds and scientific requirements. (Sections 5.1.1 and 5.3.2) IRNC has demonstrated throughput (sustained gigabits per second) and quality of service for certain applications that far exceed what is possible on the commodity Internet. (Sections 3.2 and 3.3) 2) IRNC traffic in 2014 will triple by 2018. From 2009 to 2013, IRNC traffic is estimated to have grown by a factor of 5.7. This growth is similar to or slightly higher than the growth of the global Internet over the same period of time. (Figure 11 and Section 5.3.2) o The growth rate beyond 2018 may be even greater as “the Internet of Things” (eg: sensor networks) develops. The number of devices connected to IP networks will be twice as high as the global population by 2018, accompanied by a growing number of machine to machine (M2M) applications. (Section 5.3.3) o A review of past internet traffic data as well as technology development in general indicates that the growth trend has been and will remain exponential. (Section 5.3.3) Known scientific data drivers for the IRNC in 2020 will include: o Single site instruments serving: (Sections 5.1.3 and 5.3.1) a) the globe’s astronomers and astrophysicists, such as The Large Synoptic Survey Telescope (LSST) [1], Attacama Largemillimeter/submillimeter 3 Array (ALMA) [2], Square Kilometer Array (SKA) [3], and James Web Space Telescope (JWST) [4]; b) the globe’s physicists, such as the Large Hadron Collider (LHC [5]), International Thermonuclear Experimental Reactor (ITER) [6], and Belle detector collaborations [7]. o Thousands of highly distributed genomics and bioinformatics sites, including: (Sections 5.4.2) a) high throughput sequencing; b) medical images; c) integrated Light Detection and Ranging (LIDAR) [8]. d) Each site will produce as much data as a single site telescope or detector. As these communities gain expertise in creating, analyzing and sharing such data, the number of extremely large data sets transiting networks will increase by a 103 order of magnitude. o Climate data, aggregated data sources (including sensor networks) and bottom-up collaborations will drive increased data for the global geoscience community. (Sections 5.3.1 and 5.4.3) o CISE researchers will be working with petabyte sized research data sets to explore improved search/recommendation algorithms, deep learning and social media, machine to machine (M2M) applications, military and heath applications. (Section 5.4.4) 3) The IRNC program benefits from a collaborative set of network providers, but could use better organization to maximize these benefits. (Section 5.1.5) A strength of this approach is the ability to try multiple approaches to a problem at the same time and develop solutions that cross boundaries. Limitations of this approach are added complexity in the absence of a central operating center (NOC), inconsistent reporting on activities and network measurements, and absence of global reports. 4) There is limited network monitoring and measurement available among IRNC providers, which makes it very difficult to assess link utilization beyond total bandwidth used. (Section 5.2.2) There is a high interest at NSF and among science domain researchers, among others, in use of the network by discipline, by country of origin or destination, or by type of application. However, such data is in general not readily available. A perceived host of legal, political, and cultural issues make it difficult to address the lack of monitoring. To date that discussion has been held mostly among network providers. 4 5) Most end-to-end performance issues, for IRNC and high performance R&E networks (ESnet, Internet2 (I2) Regional Optical Networks (RONs), etc. are due to problems in the campus network, building, or end station equipment. (Section 5.1.5 and 5.3.1) 6) Many EPSCoR jurisdictions have fallen behind in their participation in international scientific data exchange. In 2009, EPSCoR jurisdictions had traffic on international R&E links that was comparable to many other regions within the U.S. Current utilization by EPSCoR jurisdictions is noticeably lower, reflecting uneven continued investment in regional and campus infrastructures. (Section 5.2.1 and Figure 14) 7) The impact on IRNC activities of a trend towards cloud services and data centers as large content providers is unknown. In the global Internet, traffic patterns have shifted from a hierarchical pattern where big providers connect national and regional networks to a pattern of direct connections between large content providers, data centers, and consumer networks. The impact of this transition on R&E networks is unknown. (Section 5.3.2) 1.2. RECOMMENDATIONS The purpose of this report is to assist NSF in predicting the amount of scientific data that will be moving across international R&E networks by 2020, and also to discover special characteristics of that data such as time sensitivity or location. In addition, the study was to develop a method for conducting this analysis that could be repeated in the future. The key findings, listed in section 1.1, show that there will be a continued exponential increase in international scientific data over the next five years. The recommendations below are “low hanging fruit” that, if followed, will best capture the opportunities and mitigate the current and future challenges of operating international R&E networks supporting data-driven science. 1) Establish a consulting service or clearing house for information on the IRNC. The key service would be to facilitate discussions between scientists and network engineers regarding the characteristics and requirements of their data. The Department of Energy (DoE) and NSF Polar programs do this for their larger science programs. This approach could build a bridge to increase scientific productivity. This service could be a function of the new IRNC Network Operations Center (NOC) called for in NSF RFP-14-554. Alternatively, this service could possibly be supported by making experts available on retainer to those who need assistance. For large NSF programs, once past the pre-proposal selection, NSF could assign this assistance to help at least large and medium scale science projects understand and plan for their international network capacity and impact on their requirements. 5 Domain specific workshops that include scientists, campus network staff, and backbone provider staff could be held to dig into application requirements details and learn from success stories; some of these are expected to result from the NSF CC*NIE and CI*Engineer program awards. 2) Establish a single Network Operations Center for US International network providers so that users and regional operators have a single place to contact. This service is likely to be a function of the new IRNC Network Operations Center (NOC) called for in NSF RFP-14-554. This service would be a central point of contact for campus, regional, and national R&E network operators and staff to contact international R&E networks, both in the U.S. and elsewhere, regarding troubleshooting, special requirements, and other matters relevant to optimal end-to-end connections. This service would be report on the status of all international R&E links, such as up/down, current load, and service announcements. This service could provide uniform and comprehensive reporting on network traffic. 3) Establish global and uniform network measurement and reporting among IRNC providers, including more detailed utilization information and historical reporting. Move the dialog on this topic out of a network operators-only context. Establish/adopt a standard meta-data description for network traffic (eg: the schema developed for the GLORIAD Insight [9] monitoring system) to enable IRNC-wide reporting and achieve common reporting to the extent that policy allows. Implement the measurement recommendations made at two or more IRNC network monitoring meetings, ie: begin with passive Domain Name Service (DNS) [10;11] record reporting. Accessible packet loss reports are also of high interest. 4) Continue to support collaborative coordination among network providers, within the US and with external network partners. Foster organizations that build the community working together across international boundaries. Successful examples include the Global Lambda Integrated Facility (GLIF) [12] that works together to develop an international optical network infrastructure for science by identifying equipment, connection requirements, and necessary engineering functions and services. Another example is the R&E Open Exchange Points that support bi-lateral peering at all layers of the network. 5) Increase outreach and training to campus network staff in topics such as Border Gateway Protocol (BGP) [13;14], Software Defined Networks (SDN) [15] , wide-area-networking, 6 how to debug last 100 feet issues, and how to talk with faculty about their application requirements. 6) Address the uneven development of cyberinfrastructure; it is a barrier to collaboration. ACI and scientists in EPSCoR jurisdictions should work with the EPSCoR program to address the growing network inequality gap Continue the Network Startup Resource Center that focuses on training for network operators in countries whose IP traffic will grow most rapidly from now to 2020 – the Middle East and Africa. 7) Focus on the following in engineering IRNC networks: Continue to facilitate the transfer for extremely large data sets/streams; an international drop-box may be useful. Continue to push the envelope in supporting bi-directional audio/video at the highest resolutions. Prepare for the “Internet of Things”, extreme quantities of relatively small data transmissions (eg: social media, sensors) that may have delivery delay requirements Address “busy hour” traffic patterns, where average usage increases by 10-15%. 7 2. TABLE OF CONTENTS 1. EXECUTIVE SUMMARY .............................................................................................. 2 1.1. Key Observations ...................................................................................................................... 3 1.2. Recommendations .................................................................................................................... 5 2. TABLE OF CONTENTS ................................................................................................. 8 3. INTRODUCTION TO THE IRNC NETWORK PROGRAM ................................... 10 3.1. The IRNC Production Networks ........................................................................................... 10 3.1.1. ACE (America Connects to Europe) ................................................................................................. 11 3.1.2. AmLight (Americas Lightpaths) ....................................................................................................... 12 3.1.3. GLORIAD (Global Ring Network for Advanced Applications Development)................................. 13 3.1.4. TransLight/Pacific Wave (TL/PW) ................................................................................................... 13 3.1.5. TransPAC3 ........................................................................................................................................ 14 4. 5. 3.2. The IRNC Experimental Network ......................................................................................... 14 3.3. Current IRNC Capacity and Infrastructure.......................................................................... 15 3.4. IRNC Exchange Points .......................................................................................................... 16 3.5. Emerging Networks................................................................................................................ 16 3.6. Emerging Network technologies............................................................................................ 16 3.7. Additional networks for International Science ..................................................................... 17 METHODS ...................................................................................................................... 17 4.1. Survey of Network Providers ................................................................................................. 18 4.2. Available Network Measurements ......................................................................................... 18 4.3. IRNC Utilization Compared to ESnet and Global Internet Traffic ..................................... 19 4.4. Data Trends by Science Domain ........................................................................................... 19 4.5. Catalog of International Big Data Science Programs .......................................................... 19 FINDINGS ....................................................................................................................... 20 5.1. Findings: SURVEY OF Network providers ......................................................................... 20 5.1.1. Current IRNC Infrastructure and Capacity ........................................................................................ 20 5.1.2. Current Top Application Drivers....................................................................................................... 20 5.1.3. Expected 2020 Application Drivers .................................................................................................. 21 8 5.1.4. Interaction of Network Operators with Researchers.......................................................................... 21 5.1.5. Current Challenges ............................................................................................................................ 21 5.1.6. What are Future Needs of International Networks? .......................................................................... 22 5.1.7. Exchange Point Program ................................................................................................................... 23 5.2. FINDINGS: Network measurement...................................................................................... 23 5.2.1. GLORIAD’s Insight Monitoring System ..........................................................................................25 5.2.2. The Angst over measurement and data sharing ................................................................................. 27 5.3. IRNC Network Traffic Compared to ESnet and the GLobal Internet ................................. 28 5.3.1. Synopsis of data trends for the ESnet International Networks .......................................................... 28 5.3.2. Global Internet Growth Rate ............................................................................................................. 30 5.3.3. Industry Internet Growth Forecast for 2013-2018 ............................................................................. 32 5.4. Findings: Data Trends .......................................................................................................... 33 5.4.1. Synopsis of Data Trends in Astronomy and Astrophysics ................................................................ 33 5.4.2. Data Trends in Bioinformatics and Genomics...................................................................................36 5.4.3. Data Trends in Earth, Ocean, and Space Sciences ............................................................................ 38 5.4.4. Data Trends in Computer Science ..................................................................................................... 43 5.5. Findings: Online Catalog of International Big Data Science Programs ........................... 44 5.5.1. Large Hadron Collider : 15 PB/year (CERN)................................................................................... 44 1.1.1 The Daniel K. Inouye Solar Telescope (DKIST): 15 PB/year........................................................... 45 1.1.2 Dark Energy Survey (DES): (1 GB image, 400 images per night, instrument steering). .................. 46 1.1.3 Large Square Kilometer Array (Australia & South Africa) (100 PB per day) .................................. 46 5.6. Findings: Survey of NSF-funded PIs................................................................................... 47 6. REFERENCES................................................................................................................ 49 7. APPENDICES ................................................................................................................. 57 7.1. Appendix A: Interview with Network and Exchange Point Operators ............................... 58 7.2. Appendix B: On-Line Survey for NSF-funded PIs.............................................................. 60 7.3. Appendix C: Summary of Responses to NSF PI Survey ..................................................... 70 7.4. Appendix D: List of Persons Interviewed ............................................................................. 77 7.5. Appendix E: Reports used for this study ............................................................................. 78 7.6. Appendix F: Scientific and Network Community Meetings attended for report input ...... 78 9 3. INTRODUCTION TO THE IRNC NETWORK PROGRAM The International Research Network Connections (IRNC) network providers implement Research and Education (R&E) networks that implement policies and network operational procedures that are driven by the needs of international research and education programs. IRNC leverages existing commercial telecommunications providers’ investments in undersea communication cables and university and Regional Optical Network (RON) expertise in operating regional and national R&E networks. The U.S., through the NSF, invests approximately $7M/year in the IRNC program; this modest investment is highly leveraged by a factor of 10 to 15 via international partner investments supporting international R&E network links. IRNC networks are open to and used by the entire U.S. research and education community and operate invisibly to the vast majority of users. The IRNC networks support unique scientific and education application requirements that are not met by services across commercial backbones. In addition, the IRNC network providers are closely connected to researcher’s needs and requirements and are attuned to meeting their needs as a primary motivator. Examples of such requirements include hybrid network services; low latency and real time services, and end-to-end performance management. In this regard, the IRNC extends the fabric of campus, regional, and national R&E networks across oceans and continents, a web of connections built in collaboration with international partner R&E networks that serve the growing number of international scientific collaborations. The IRNC program was most recently funded for years 2009-2014; NSF is currently reviewing responses to solicitation 14-554. The new awards will continue to provide production network connections and services to link U.S. research networks with peer networks in other parts of the world and leverage existing international network connectivity; support U.S. infrastructure and innovation of open network exchange points; provide a centralized facility for R&E Network Operation Centers (NOC) operations and innovation that will drive state-of-the-art capabilities; stimulate the development, application and use of advanced network measurement capabilities and services across international network paths; and support global R&E network engineering community engagement and coordination. 3.1. THE IRNC PRODUCTION NETWORKS IRNC network providers acquire, manage and operate network transport facilities across international boundaries for shared scientific use. Network providers make arrangements with owners of optical fiber, including undersea fiber cables, to use some portion of this installed infrastructure, using equipment and management practices dedicated to R&E traffic. All shared R&E international networks are funded by the NSF in cooperation with the governments of other countries. The independently managed networks exchange traffic at Exchange Points; there, operators can implement bi-lateral policies to receive traffic from other networks and send traffic to other networks; this includes passing traffic thru network B so that network A can reach network C. 10 FIGURE2.MAPOFTHEIRNCNETWORKS2014 Policies can derive from human policy, current traffic conditions, current costs, and so forth. Exchange points are the focus for policy and technical coherence. A map of the IRNC networks 2009-2014 is shown in Figure 2 (map from the Center for Applied Data Analysis (CAIDA)[16]). A current limitation of this overview map, and some of the following regional maps, is that they rely on manual updates to static files so maps are therefore not likely to be current. Networks shown include five production networks: ACE; AmLight; GLORIAD; TransPAC3 and PacificWave, and one experimental network (TransLight). 3.1.1. ACE (America Connects to Europe) FIGURE3.AMERICACONNECTSTOEUROPENETWORKMAP 11 ACE (NSF Award #0962973) [17] is led by Jennifer Schopf of Indiana University, in partnership with Delivery of Advanced Network Technology to Europe (DANTE) [18], Trans-European Research and Education Networking Association (TERENA) [19], New York State Education and Research Network (NYSERNet) [20], and Internet2 (I2) [21]. This project connects a community of more than 34 national R&E networks in Europe. 3.1.2. AmLight (Americas Lightpaths) AmLight (NSF Award #0963053) [22] is led by Julio Ibarra of Florida International University. This program ties together the major research networks of Canada, Brazil, Chile, Mexico, and the United States. In addition, this work enables interconnects between the United States and the Latin American Cooperation of Advanced Networks (RedCLARA) [23] that connects eighteen Latin FIGURE4.AMERICASLIGHTPATHNETWORKMAP 12 American national R&E networks. The Atlantic Wave and Pacific Wave Exchange Points provide peering for the North American backbone networks I2, the U.S. Department of Energy’s Energy Sciences Network (ESnet) [24], and Canada's Advanced Research and Innovation Network (CANARIE) [25]. 3.1.3. GLORIAD (Global Ring Network for Advanced Applications Development) GLORIAD (NSF Award #0441102) [26] is led by Greg Cole at the University of Tennessee, Knoxville. It includes cooperative partnerships with partners in Russia (Kurchatov Institute) [27], Korea Institute of Science and Technology Information (KISTI) [28], China (Chinese Academy of Sciences) [29], the Netherlands (SURFnet) [30], the Nordic countries (NORDUnet [31] and IceLink [32]), Canada (CANARIE), and the Hong Kong Open Exchange Portal (HKOEP) [33]. In addition, new partnerships are being developed with Egypt (ENSTINet [34] and Telecomm Egypt [35]), India (Tata Communications [36] and the National Knowledge Network) [37], Singapore (SingAREN [38]), and Vietnam (VinAREN [39]). FIGURE5.GLORIADNETWORKMAP 3.1.4. TransLight/Pacific Wave (TL/PW) FIGURE6.TRANSLIGHT/PACIFICWAVENETWORKMAP 13 TL/PW (NSF Award #0962931) [40] is led by David Lassner, University of Hawaii. TL/PW presents a unified connectivity face toward the West for all U.S. R&E networks including I2 and Federal agency networks, enabling general and specific peerings with more than 15 international R&E links. This project not only provides a connection for Australia's R&E networking community but also provides connectivity for the world's premiere setting for astronomical observatories, the summit of Mauna Kea on the Big Island of Hawaii. The Mauna Kea observatories comprise over $1 billion of international investment by 13 countries in some of the most important cyberinfrastructure resources in the world. 3.1.5. TransPAC3 TransPAC3 (NSF Award #0962968) [41] is led by Jen Schopf at Indiana University. The R&E networks included in the TransPAC3 collaboration cover all of Asia excluding only North Korea, Brunei, Myanmar, and Mongolia. TransPAC3 collaborates with the Asia Pacific Advanced Network (APAN) [42], DANTE , Internet2 and other R&E networks. FIGURE7.TRANSPAC3NETWORKMAP 3.2. THE IRNC EXPERIMENTAL NETWORK TransLight/Starlight (NSF Award #0962997) [43] is led by Tom DeFanti at the University of California, San Diego. The award provides two connections between the U.S. and Europe for production science: a routed connection that connects the pan-European GEANT2 to the U.S. I2 and ESnet networks, and a switched connection that is part of the LambdaGrid fabric being created by participants of the GLIF. StarLight is a multi-100Gb/s exchange facility, peering with 130 separate R&E networks. This network is unique among IRNC networks in that it is entirely dedicated to research traffic, and carries no educational commodity type traffic (email, web pages,etc.). Translight uses optical networking, a means of communication that uses signals encoded onto light, that can operate at distances from local and transoceanic and is capable of extremely high bandwidth. Optical networking can be used to partition optical fiber segments so that traffic is entirely segregated, with different policies or techniques applied to each segment. In collaboration with GLIF partners, Translight has been able to provide high bandwidth, low latency performance for interactive highdefinition visualization and other types of demanding applications. A limitation is that scheduling is required. 14 3.3. CURRENT IRNC CAPACITY AND INFRASTRUCTURE To provide context for the description of IRNC capacity, some background information on network engineering is helpful. Traditional Internet networking is based on the TCP/IP protocol suite [44]. TCP/IP is designed to be a best-effort delivery service; there may be significant variation in the amount of time it takes a data packet to be delivered and in the amount of delay between packets. Greater congestion in the network results in greater performance variation, including the possibility of delivery failure, especially in the case of very large files. As a general practice, internet providers achieve the desired network performance by arranging for an abundance of bandwidth; a network that operates at 50% capacity is considered well engineered since it has “headroom” to accommodate any sudden bursts of traffic. TCP/IP has built-in congestion algorithms that provide equitable use of the bandwidth based on current traffic conditions; this means end-users can use the Internet at their convenience without scheduling or “busy signals”. A consequence of this approach is that measures of throughput, latency and jitter for identical data traveling the same geographic path can vary significantly due to other traffic on the network when the measurement takes place. The IRNC production networks have been engineered to provide an abundance of bandwidth. The IRNC experimental network, in contrast, is engineered to deliberately use 100% of bandwidth, continuously; this design is possible because this network allows only pre-authorized traffic and makes direct use of the underlying optical network. Via optical networking, the StarWave experimental network can provide single 100Gbs sustained transfers over long periods of time, as well as to multiple sustained 1 Gbs/10Gbs flows. End-to-end network configuration and scheduling is now accomplished in an automated manner. In 2014, all networks except TransLight have at least one 100Gbps network path; the exception is due to the high cost of fiber crossing the Pacific Ocean (a factor of 5 higher than the Atlantic). TransLight provides 40Gbps total bandwidth to Australia and elsewhere in Asia. In 2014, a new 40Gbps direct route to New Zealand was accomplished. The Pacific routes are expected to be upgraded to 100Gbps in 20161. All networks have redundant paths across oceans, except for the trans-Pacific connection due to cost. These performance numbers compare to regional and national backbone speeds, and to the I2 Innovation Platform. Campuses/facilities that have joined the I2 Science DMZ (“Demilitarized Zone”2) [45] have access to a SDN enabled, firewall-free, 100Gbps network path that allows them to experience the highest level of end-to-end performance. Thus, network performance across oceans and/or continents should ideally be impacted mostly by the distance involved and not by network bottlenecks in the path. 1ConversationwithDavidLassner,September2014 2DevelopedbyESnetengineers,theScienceDMZmodeladdressescommonnetworkperformanceproblems encounteredatresearchinstitutionsbycreatinganenvironmentthatistailoredtotheneedsofhigh performancescienceapplications,includinghigh‐volumebulkdatatransfer,remoteexperimentcontrol,and datavisualization. 15 3.4. IRNC EXCHANGE POINTS Network exchange points for research and education flows have served a pivotal role over the last 20 years in extending network connectivity internationally, providing regional R&E networking leadership, and supporting experimental networking. Through years of operational experience combined with international peering relationships, engineering activities, and international networking forums, a set of guiding principles have emerged for successful approaches to an open exchange point. Exchange points support the homing of multiple international links and provide high capacity connectivity to I2 and ESnet. They also provide maximum flexibility in connectivity and peering, for example services at multiple layers of the network. 3.5. EMERGING NETWORKS The Network Startup Resource Center (NSRC) [46] develops and enhances network infrastructure for collaborative research, education, and international partnerships, while promoting teaching and training via the transfer of technology. This IRNC project focuses NSRC activities on cultivating cyberinfrastructure via technical exchange, engineering assistance, training, conveyance of networking equipment and technical reference materials, and related activities to promote network technology adoption, and enhanced connectivity in R&E sites around the world. The end goal is to enhance and enable international collaboration via the Internet between U.S. scientists and collaborators in developing countries. Active progress has occurred in National Research and Education Networks (NRENs) and Research Education Networks (RENs) in Southeast Asia, Africa and the Caribbean; this work will continue through NSF award #1451045 in the amount of $3.7M to Steve Huter, Dale Smith and Bill Allen for “IRNC: ENgage: Building Network Expertise and Capacity for International Science Collaboration” starting October 1, 2014. 3.6. EMERGING NETWORK TECHNOLOGIES The StarLight experimental IRNC has had extensive experience with SDN and has supported multiple international demonstrations and research projects in this area. IRNC funded networks have also participated in the GLIF community that has developed the Network Service Interface (NSI) standard [47] through the Open Grid Forum (OGF) [48]. NSI describes a standardized interface for use at optical network exchange points, providing a foundation for automated scheduling and deploying of optical circuits across provider and technology boundaries. The production IRNCs had not yet deployed SDN, with the exception of some work begun in AmLight in summer 2014, leveraging accomplishments of the Global Environment for Network Innovations (GENI) program [49], I2’s Advanced Layer 2 Services (ALS2) [50] configuration tool, and some GENI funded work in Brazil. 16 3.7. ADDITIONAL NETWORKS FOR INTERNATIONAL SCIENCE In the past, research conducted at the North and South Poles, on board ships, in space, or using distributed sensors relied on workflows where data was stored on site at or in the instrument and then manually transported on some schedule to an analysis center. Due to the rapidly growing satellite and other non-terrestrial telecommunications infrastructure, workflows are shifting from periodic and manual to near real-time, using the Internet. Whether manual or by Internet, the data at some point becomes connected to the national and international research networks where data sharing and collaboration occur. In addition to the NSF-funded IRNC networks, international science relies on shipboard, satellite and space networks to capture and forward data. Examples of these networks include: The Global Telecommunications System (GTS) [51], global network for the transmission of meteorological data from weather stations, satellites and numerical weather prediction centers. HiSeasNet [52], a satellite communications network designed specifically to provide continuous Internet connectivity for oceanographic research ships and platforms. HiSeasNet plans to provide real-time transmission of data to shore-side collaborators; basic communications including videoconferencing, and tools for real-time classroom and other outreach activities. The NASA Space Network [53] consists of the on-orbit telecommunications Tracking and Data Relay Satellite (TDRS) satellites, placed in geosynchronous orbit, and the associated TDRS ground stations, located in White Sands, New Mexico and Guam. The TDRS constellation is capable of providing nearly continuous high bandwidth (S, Ku, and Ka band) telecommunications services for space research, including: the Hubble Space Telescope [54], the Earth Observing Fleet [55] and the International Space Station [56]. Certain applications such as the LHC and NSF Division of Polar Programs fund their own dedicated network circuits. LHC leverages the ESnet network. Polar traffic is limited by geographic locations. Several research programs use both ESnet and IRNC networks. 4. METHODS The primary purpose of this study was to project the amount of data being exchanged via IRNC networks in the year 2020. The initial study plan was to survey the IRNC network providers, review their annual reports, examine measured network traffic over the IRNC links, and conduct interviews with representative international science programs. This approach proved challenging because of wide variation among IRNC providers in their degree of participation in and knowledge of projects using their networks. Another challenge was that most of the IRNC networks were measuring/recording only total bandwidth utilization, an 17 approach providing limited information to analyze. providing more detailed network history information. The GLORIAD network was unique in A survey with detailed questions regarding file size, time to transfer requirements, type of file systems the files are stored on and so forth was prepared but after conducting several in-person interviews, it became apparent that few international science projects have detailed information about their current and future plans to produce, transport, store, analyze and share data at a level of detail that is useful for network capacity and service planning. ESnet and the NSF Polar Programs have addressed this challenge by organizing special meetings at which program scientists sat down with network engineers and spent a couple of days working through these details. This report’s advisory committee recommended that as an alternative, domain experts be asked to provide a description of data trends in their fields, taking into consideration the following factors that would be likely sources for increased IRNC traffic: New instruments with higher data/transport requirements Scaling up of current activities (e.g. more people, more data or use of data) New areas of the world increasing their traffic through local/regional improvements (Africa, Pacific Islands) New technology that that reduces, by orders of magnitude, the cost of collecting/retaining data. Programs currently funding their own communications network (e.g. the NSF Polar programs) who may eliminate move to the R&E networks In addition, a survey for NSF PIs was carried out to further explore these same questions. All surveys used are included in the appendix section, along with a list of persons interviewed and reports referenced. 4.1. SURVEY OF NETWORK PROVIDERS The PI or PI-designated representative for each of the IRNC production and research networks and exchange point operators were interviewed over the period September 2013 – January 2014. The survey used can be found in Appendix A. Questions were designed to collect data on current capabilities, data volume, user community needs, user support approach, upgrade strategies, and data projections. The questionnaire was also used with one regional network provider who connects to the IRNC. A summary of survey responses is available in section 4.1. 4.2. AVAILABLE NETWORK MEASUREMENTS The IRNC network providers were asked to provide measures of network performance for their networks. Most IRNC networks could provide some measure of bandwidth utilization over time. GLORIAD was the one network provider that had been maintaining records of IP flows (one flow 18 typically corresponds to one application) over its ten years of operation. The absence of more detailed information about network traffic on IRNC networks was explained by providers as resulting from (a) strict concerns among the European research community regarding privacy, (b) challenges in developing multiple bi-lateral policies among allowing such measurement, and (c) lower priority/lack of funding. A summary of available measurements is described in section 4.2. 4.3. IRNC UTILIZATION COMPARED TO ESnet AND GLOBAL INTERNET TRAFFIC Data representing IRNC traffic that was available for the period 2009-2013 was compared to an analysis of traffic on the global Internet for the same period of time. The analysis is in section 4.4 4.4. DATA TRENDS BY SCIENCE DOMAIN Science domain experts provided written descriptions of these trends. These contributions are available in section 4.4. 4.5. CATALOG OF INTERNATIONAL BIG DATA SCIENCE PROGRAMS A questionnaire was developed for interviewing science communities requiring international data exchange. The target science communities were identified by asking IRNC providers to identify science disciplines that produced the highest data volume both now and potentially in the future. In addition, scientists from large programs in those disciplines were asked to describe their current data collection and storage volume and needs, the resources utilized to transmit data, technical community interaction, and data projection strategies. Due to high variation in quality and depth of responses, the study moved toward broader exploration of international big science programs via Internet searches, attending a variety of science domain community meetings, and through responses to an on-line survey of NSF Principal Investigators. This survey (Appendix B) was designed to capture existing and planned international science collaborations, knowledge of new instrumentation, and extent of international collaboration within NSF funded programs. Using publicly available information from the NSF awards site, 30,897 PIs receiving NSF funding from FY2009 – July 2014 were invited to respond to the survey. A total of 4,050 persons responded, a 13% response rate. At approximately the same time (summer 2013), Jim Williams at Indiana University and Internet2 began collecting “The International Big Science List” [57]. These efforts have been much-expanded and placed within a framework for describing scientific data. Called the “Catalog of International Big Data Science”, this interactive and updatable website is available at http://irnc.clemson.edu. 19 5. FINDINGS 5.1. FINDINGS: SURVEY OF NETWORK PROVIDERS 5.1.1. Current IRNC Infrastructure and Capacity The IRNC production networks operate using industry-standard TCP/IP networking, and IRNC providers described their services as state-of-the-art. The experimental StarLight network supports the use of specialized high performance protocols such as UDT [58]. The IRNC experimental network used optical networking technology and protocols developed by GLIF. StarLight has more than ten years of experience with programmable networking in support of many international Grid projects. Approximately seven years ago, StarLight began investigating SDN/OpenFlow [59;60] technologies. Subsequently, the StarLight community worked with GENI to design and implement a national wide distributed environment for network researchers based on SDN/OpenFlow, using a national mesoscale L2 network as a production facility. For over three years, with IRNC and GENI support, and with many international partners, StarLight has participated in the design, implementation and operation of the world's most extensive international SDN/OpenFlow testbeds, with over 40 sites in North American, South America, Europe and Asia. With support from GENI and the international network testbed community, a prototype Software Defined Networking Exchange (SDX) was designed and implemented at the StarLight Facility in November 2013 and used to demonstrate its potential to support international science projects. For over five years, the StarLight consortium has worked with the GLIF community and OGF to develop and implement an NSI Connection Service. More recently, StarLight has been supporting a project that is integrating NSI Connection Service 2.0 and SDN/OpenFlow. All IRNC networks except TransLight/PacificWave have at least one 100Gbps network path; this matches the transition of campus, regional and national R&E network providers to 100Gbps external network speeds. The high cost of crossing the Pacific Ocean (a cost factor of 5 higher than the Atlantic) presents a challenge. TransLight/Pacific Wave provides a 100Gbs path from LA to Hawaii, and a 40Gbps total bandwidth to Australia, New Zealand and elsewhere in Asia. 5.1.2. Current Top Application Drivers Applications currently having top bandwidth or other demanding network requirements that were named by IRNC providers included: The Large Hadron Collider ( Tier 1 transfer from CERN to Europe, US and Australia) Computational Genomics Radio telescopes Computational astrophysics Climatology Nanotechnology. Fusion energy data Light sources (synchrotrons) Astronomy 20 5.1.3. Expected 2020 Application Drivers Looking forward to 2020, IRNC network providers expected application drives to remain the same as in section 4.1.2, with the addition of: More visualization Astronomy moving away from shipping tapes/drives for near-real time reaction to events in order to verify the event and focus observation instruments. Video, live and uncompressed. Climate Science and Geology: collecting more LIDAR data as needed, combined with other data (eg: earthquake monitoring and response) Larger sensor networks, especially in portions of the globe where there is currently no weather data being collected. The Square Kilometer Array [61] being built in Australia and South Africa The catalog of life on this plant is growing larger and larger; it will be stored in many locations. What’s complicated is the coordination needed to become a single data set; some type of federated model is needed. Data is currently concentrated in US but will be more global in nature; look at where new telescopes are located. Consider where population density is (China, India). Data will become global in nature in terms of where it needs to go and where it will rest. 5.1.4. Interaction of Network Operators with Researchers Network operators interacted most frequently with other people supporting R&E networks, and typically had infrequent interaction with scientists. “We are often surprised and discouraged by the overall lack of interaction between researchers/scientists and their network operators, sometimes within the same institution. “ (NSRC interview) The AmLight and StarLight programs were exceptions to this pattern, and each reported the most detailed knowledge of end-user applications. 5.1.5. Current Challenges When asked to describe current challenges they face, the IRNC network providers identified the following: 1) Lack of wide area network knowledge on campuses IRNC providers’ experience is that most network problems reported were caused by the end system; for example, an underpowered machine having inadequate memory, slow hard disk, or slow network card. A misconfiguration in the campus Local Area Network was another example. IRNC providers view the role of campus network staff as sitting at the edge of the web of regional, national, and international network connections; they are responsible for connecting their campus to this web. In this role, campus network staff are an essential component of end-to-end support but, unfortunately, are frequently not knowledgeable about wide area technologies and therefore 21 cannot provide problem solving assistance in end-to-end problem solving without the IRNC (or regional or national network) providers’ assistance. Network path troubleshooting is still a very people intensive process. The person who can do this needs a well-rounded skill set: he/she must understand end-user requirements, storage, computation, and the application as well as networking. There is room for automation here. IRNC providers would like to see more training for campus network staff in in BGP, SDN, and wide area networking. 2) Inadequate and uneven campus infrastructure There are challenges getting the local network infrastructure (wiring/switches) ready to support an application. Campuses have multiple and perhaps conflicting demands for investment in the campus wiring and network electronics infrastructure. Getting wiring and electronics upgraded all the way to a specific end user’s location in the heart of a campus may not be a high priority for the campus. 3) Poor coordination among network providers The regional, national, and international network web of connections are not well coordinated. As a result, it can sometimes be difficult to identify the right person to contact during end-to-end troubleshooting, and there may be inefficiencies in the investment. The network path crosses multiple organizations; the hard part is figuring out which network segment has the issue, then working with individual researchers to fix the local system or network. 4) Interoperability Challenges IRNC providers face interoperability challenges in connecting 10Gbs and 100Gbs circuits; optical networks and software defined networks; and the different implementations of SDN provide additional challenges. 5) Adoption of New Network Technologies IRNC providers expect rapid growth in optical networking and SDN in the next 3-4 years. They are concerned that science communities don’t appear to know anything about this yet. They are also concerned about whether it will be easy enough for end-users to use. 5.1.6. What are Future Needs of International Networks? Most IRNC providers identified science communities’ requirements as the best drivers for future network directions. In general, scientists are always pushing the frontiers of advanced networking and thus encounter new problems. The network needs to be thought of as a global resource. It is important to work collaboratively to coordinate and provide solutions that cross boundaries. Organizations that foster working together across international boundaries are needed. The NSF Exchange Point program and GLIF were mentioned as two examples that are working well. West Coast providers were particularly concerned about funding. “The US government funds only circuits into US, not the other way around. Other countries are now paying much more than the funding provided by NSF. And the focus on instruments of interest are shifting away from the US – eg: LHC and large 22 square K array. The budget should shift by a factor of 10, particularly on the west coast where 10G across the Pacific is a factor of 5 higher cost than the Atlantic.” 5.1.7. Exchange Point Program The International Exchange Point program is seen by IRNC network operators as a success: “It is terrifically important and significant and successful.” “I think it’s been very important & will be more so, later. More small countries are coming in with their own NRENS. Geological, climate and genomics are requiring information coming in from all over. “ “US exchange points are very important – without them, the US would not be as much in the center of this as it is. I worry about the long-term impacts of the global R&E community’s reaction to the allegations about NSA. There are some communities that are spending their own money to get to US facilities, but are talking about going elsewhere as a result of this. From a US perspective, we must support these exchange points – they are really, really important.” 5.2. FINDINGS: NETWORK MEASUREMENT Measuring the characteristics of network traffic is the foundation for understanding the types of applications (eg: streaming video, large file transfer, email) on the network, frequency of these applications, the number and location of users, network performance, etc. Network traffic monitoring and the level of detail of any monitoring vary significantly across IRNC network providers. Recent efforts to standardize measurement have focused on universal installation of the perfSONAR platform [62] that can be very useful in understanding actual network performance on each link of a network path when debugging end-to-end application issues. However, with the exception of the GLORIAD network monitoring, long-term performance measurement reporting is limited to bandwidth utilization. A typical bandwidth utilization report is represented in Figure 8, showing average use of 40Gbps links into and out of the PacificWave traffic router during Q4-2103. Blue represents incoming traffic with an average utilization of 14.45Gbps; green represents outgoing traffic with an average utilization of 15.32Gbps. Each line in the graph is itself an average over the portion of the week represented. The bursty nature of network traffic is reflected in the shape of the graph; the spikes or peaks show that on occasion traffic can be double the average, thus the “headroom” requirement. It should be noted that the graph represents only the public portion of the PacificWave exchange and does not represent all of the traffic over the facility. There are private connections and CAVEwave [63] and CineGrid [64] traffic (part of the StarWave experimental network) that are not included in these numbers. 23 FIGURE8.QUARTERLYPACIFICWAVETRAFFICFOROCT,NOV,ANDDEC2013 Figure 9 demonstrates the rapid rate at which scientists discover and utilize available tools. The graph was derived from 62 inbound PacificWave quarterly graphs covering the time period January 2006-December 2013. The dark blue horizontal lines indicate link capacity; bandwidth was increased from 1Gbs to 10Gbs (210) and then 40Gbs (2012). The light blue vertical bars represent each quarter’s average throughput for that period of time, and the black ‘T’ shape above the quarter average indicates the peak throughput recorded for that period. FIGURE9.EXAMPLEOFIRNCNETWORKUTILIZATIONOVER8YEARS 24 5.2.1. GLORIAD’s Insight Monitoring System The GLORIAD “Insight” system [65] provides flexible, interactive exploration and analysis for all GLORIAD backbone traffic since GLORIAD’s beginning as “MIRnet” in 1999. Insight is open source software developed by GLORIAD in collaboration with the China Science and Technology Network (CSTnet) [66] and Korea's KISTI. Large IP flows are the units measured, and searchable information includes traffic volume, source/destination country, packet loss, source/destination by U.S. State, traffic volume by application type or scientific discipline, network VLAN or ASNUM, or network protocol. Both live and historical data is available. Figure 10 summarizes all GLORIAD traffic from 2009-2013 by world region; Figure 11 provides an example snapshot of a packet loss incident. Total data stored to date comprises almost 2 billion records, with a million new records added each day. FIGURE10.GLORIADLARGEFLOWTOTALTRAFFIC2009‐2013,BYWORLDREGION Packet loss is a measure indicating significant network congestion and/or interruptions in network service. Insight allows network operators to drill down into live traffic during packet loss to problem solve. The drill-down path can follow any of the flow’s recorded attributes such as protocol, application, institution, etc. 25 Insight categorizes applications using well-known communication port numbers or that exhibit certain wellknown behaviors. Using GLORIAD’s historical network records and Insight, it was possible to compare IP flow data for the year 2009 to year 2013. In terms of total bytes per year, GLORIAD recorded 0.6 PB in 2009 and 3.7 PB in 2013, a cumulative growth factor of 5.7 (Figure 12). FIGURE11.INSIGHTPACKETLOSSDISPLAYEXAMPLE o The File Transfer Protocol (FTP) [67;68] is no longer a top 10 application; large data files are being moved by other applications such as the Aspera high throughput file transfer software [69]. Insight also reports the top applications in terms of total bytes for that application. Comparing 2009 to 2013 (Figure 13: Strike-thrus in the labels show applications present in only one of the summaries), important changes include: GLORIAD Annual Top 10 Large Flows Total PetaByes : 5.7 Growth Factor PB 4.0 3.5 3.0 2.5 o IPV6 [70;71] is no longer being tunneled thru IPV4 [72;73], so it is no longer listed as an application; this reflects significant adoption of IPV6, which is especially important to developing countries who arrived at the Internet after the IPV4 address space was exhausted. 2.0 1.5 1.0 0.5 0.0 2009 2013 FIGURE12.GROWTHINGLORIADTOTALANNUALBANDWIDTH OVER5YEARS o The appearance of HTTPS indicates a higher level of attention to securing communication with encryption. Insight also displays a map of shaded geographic regions; each flow is labeled by country, or US state, and Regions with more hits are shaded darker, indicating locations sending or receiving the greatest number of flows. For each year, the leftmost side maps Source flows, while the right hand side maps Destinations. Comparing the year 2009 to 2013 (Figure 14) shows that within the U.S., many EPSCoR jurisdictions [74] have fallen behind in building their R&E network capabilities, 26 2009 Top Ten Applications Other (TCP) 2013 Top Ten Applications HTTP Other (TCP) HTTP Other (UDP) Other (UDP) Aspera Aspera ROOTD ROOTD HTTPS HTTPS NICELink NICELink SSH SSH Unlabeled Unlabeled COMMplex COMMplex HyPack Data Ac HyPack Data Ac Unidata LDM Unidata LDM FTP FTP Other (IPV6) Other (IPV6) FIGURE13.CHANGEINAPPLICATIONSMOSTFREQUENTLYUSEDFORGLORIADNETWORK,COMPARING2009 TO2013 although some like South Carolina and Alabama have actually increased their use of the international R&E networks. 2009 2013 FIGURE14.TOTALBANDWIDTHPLOTTEDBYUSSTATESHOWSADROPINBANDWIDTHFROMEPSCOR JURISDICTIONS,PROBABLYDUETOLOWERINVESTMENTININFRASTRUCTUREUPGRADES.(SOURCE LOCATIONISONLEFTHANDSIDEOFEACHPANEL,DESTINATIONLOCATIONONTHERIGHT) 5.2.2. The Angst over measurement and data sharing At least two meetings have been held that included IRNC network providers and addressed the topics of IRNC measurement and IRNC cybersecurity; these meetings issues reports with recommendations, including “Security at the Cyberborder Workshop Report: Exploring the relationship of International Research network Connections and CyberSecurity” [75] and “International Research Network Connections: Usage and Value Measurement” [76]. At a January 2013 PIs meeting, each IRNC project summarized existing baseline measurement capabilities and described possible target measurement capabilities and services beyond the current IRNC program 27 phase. The discussion emphasized the usefulness to funding agencies, users, and resource providers of the new Extreme Science and Engineering Discovery Environment (XSEDE) [77] resource utilization tool XDMod [78], especially its support for “drill down” into finer detail. IRNC PIs mentioned a host of privacy and technical issues they consider obstacles to similar visibility into government-funded network resources. Transparency in network use was described as a complex partnership between NSF, grantee institutions, carriers, and international partners (although most of these were not represented in this conversation). Existing data sharing frameworks and display strategies could be put to work if the data were available and standardized. Both meetings agreed that a lowest common denominator for agreeing to share data would be to passively capture and report on DNS packets. However, these recommendations have not been implemented yet by all participants. 5.3. IRNC NETWORK TRAFFIC COMPARED TO ESnet AND THE GLOBAL INTERNET How does this analysis of GLORIAD’s network traffic over the last 5 years compare to what is known about other research networks, or even the commodity Internet traffic over the same period of time? 5.3.1. Synopsis of data trends for the ESnet International Networks FIGURE15.ESNETPROJECTEDLARGE‐SCALESCIENCEDATATRANSFERSCOMPARED TOHISTORICALTRENDS 28 Author: Brian Tierney, Staff Scientist, ESnet Advanced Network Technologies Group, Lawrence Berkeley National Laboratory The DoE operates a network that is dedicated for use by DoE applications and requirements; it is called ESnet, and ESnet also operates internationally. Because the set of applications supported is much smaller than that of the IRNC, and perhaps other factors, DoE has conducted application requirements workshops for their users and has published projected capacity requirements [79;80]. Based on data collected at the 2012 requirements workshops summarized in the report, ESnet estimates that traffic to Europe will continue to grow exponentially until at least 2022 (Figure 15). ESnet Big Data Applications The LHC, the most well-known high-energy physics collaboration, was a driving force in the deployment of high bandwidth connections in the research and education world. Early on, the LHC community understood the challenges presented by their extraordinary instrument in terms of data generation, distribution, and analysis. This detector will be back online again in early 2015, and it is expected to produce 2-5 times more data per year after that [81;82]. In climate science, researchers must analyze observational and simulation data sets located at facilities around the world. Climate data is expected to exceed 100 exabytes (1 exabyte = 1000 petabytes) by 2020. New detectors being deployed at X-ray synchrotrons generate data at unprecedented resolution and refresh rates. The current generation of instruments can produce 300 or more megabytes per second and the next generation will produce data volumes many times higher; in some cases, data rates will exceed DRAM bandwidth3, and data will be preprocessed in real time with dedicated silicon [83]. Large-scale, data-intensive international science projects on the drawing board include the International Thermonuclear Experimental Reactor (ITER) and the SKA, a massive radio telescope that will generate as much or more data than the LHC. The Belle II experiment, based in Tsukuba, Japan, is part of a broad-based search for new physics involving over 400 physicists from 55 institutions across four continents. The first data from Belle II is expected in 2015, and data rates between Japan and the USA are expected to be about 19 Gbps in 2018 and 25 Gbps in 2021 [84]. ESnet collects traffic utilization data on its entire international links, but unfortunately they do not currently have an easy way to aggregate that data into detailed overall summaries. Current utilization data is available at http://graphite.es.net/. ESnet also runs an Arbor Networks NetFlow analysis system that provides the ability to do some analysis. Figure 16 is an example analysis, which seems to show ESnet’s international traffic has roughly doubled in the past three years. (Interestingly, IRNC traffic as represented by GLORIAD experienced a four-fold increase over the same period of time). 3Memory(DRAM)bandwidthistherateatwhichdatacanbereadfromorstoredintoasemiconductor memorybyaprocessor.Exceedingthisbandwidthmeansdataisbeingproducedfasterthanitcanbestored. 29 FIGURE16.ESNETREPORTFORPERIODNOV2011‐MAR2014 Data Transfer Bottlenecks ESnet’s “Science Engagement” team spends a lot of time helping scientists improve their data transfer capabilities. Based on their experience, performance bottlenecks usually fall one or more of the following categories. (Usually more than one). Hosts that are not properly tuned for high-latency networks Using the wrong tool (i.e.: scp instead of GridFTP) Packet loss, which is caused by one of the following o Undetected dirty optics/ bad fibers o Underpowered firewalls o Under-buffered switches/routers Note that packet loss is rarely caused by networks that are oversubscribed. See ESnet’s “Science DMZ” paper for more details [85]. 5.3.2. Global Internet Growth Rate Cisco, a major IP router vendor, provides current forecasts of Internet growth rates called the Cisco Visual Networking Index (VNI) [86]. Other sources referenced in the literature [87;88] have not been updated since 2009. 30 The Cisco VNI published in June 2010 predicted growth over the period 2009-2013 [89] which is the period of time covered in this study. Comparing those predictions to the IRNC experience as exemplified by GLORIAD data is shown in Table 1. Cisco VNI 2010 Prediction for Internet Growth in 2009-2014 Actual Global Internet Growth 2009-2013 GLORIAD Data 2009-2013 Global IP traffic will increase Global IP traffic increased GLORIAD traffic increased by a more than 4.3 by a factor of 5 factor of 5.7 By 2014, the highest IP-traffic generating regions will be North America, Asia Pacific, Western Europe, and Japan . The primary growth driver will be video, exceeding 91% of the consumer market by 2014, BUSINESS GLORIAD highest IP-traffic generating regions in 2013 were Asia Pacific, North America, Western Europe, Russian Federation (Note: GLORIAD is only one of 5 IRNC network providers) The primary driver is video. bandwidth Web-based video conferencing will grow 180 fold from 20092014 For the first time since 2000, peer-to-peer (P2P) traffic will not be the largest internet traffic type, being replaced by video. Data tagged as audio/video was less than .001 % of total data; 99% of the data was transported using TCP. An example “Big Data” flow is Aspera, a high speed file transfer software. Flows labeled as audio/video grew by a factor of 200. Internet video dominates global network traffic; P2P is declining globally [90] No data on P2P, but note reduction in UDP traffic (Figure 13) TABLE1.COMPARINGCISCOVNI2009PREDICTIONSTOACTUALGLOBALINTERNETGROWTHAND ALSOTOGLORIADHISTORICALDATA We can conclude from Table 1 that IRNC networks are growing at a rate similar to Cisco’s predictions regarding the commodity Internet, but with the key difference that a primary growth driver is data. This distinction is no small matter; although GLORIAD constitutes only a tiny portion of global Internet traffic (3.2E-06) it must be engineered for scientific application patterns. R&E networks really do matter! Therefore, industry forecasts can be useful in planning IRNC network capacity and design, with attention paid to the unique distribution of applications. Another important change in Internet traffic over the period 2009-2013 has been the migration of a significant amount of network traffic from global transit providers to cloud (content) providers. Traffic patterns between network administrative domains are evolving, specifically a significant shift in Internet inter-domain traffic demands and peering polices [90]. Beginning with a hierarchical model where global transit providers interconnect smaller tier-2 and regional/tier-3 providers, an evolution has occurred where the majority of commercial inter-domain traffic by volume now flows directly between large content providers (eg: Google, YouTube), data centers, and consumer networks. The result is that the majority of traffic will bypass long-haul links, and content delivery and cloud service providers connect directly to metro and regional backbones. 31 This change in traffic patterns does not appear within R&E networks, probably due to slower adoption of both commercial and private cloud technologies in higher education research. 5.3.3. Industry Internet Growth Forecast for 2013-2018 Having demonstrated that industry forecasts are useful to the IRNC, it is valuable to reflect on key findings in the most recent predictive report (Cisco VNI). A graph of the predicted growth trend (Figure 17) shows growth to an expected global IP traffic of 1.6 ZetaBytes (one billion GigaBytes) per year by year 2018. To give some sense of the size of this number, the gigabyte equivalent would be all movies ever made crossing the global Internet every 3 minutes. Key predictions in the Cisco report include: The Internet grew by fivefold from 2009-2014, largely due to the growth in number and use of mobile devices such as smart phones and tablets. In the next three years (thru 2018), the rate will grow at a threefold pace. Busy-hour Internet traffic is growing more rapidly than average Internet traffic (growing at a rate of 3.4). Content delivery networks will carry more than half of Internet traffic by 2018. Traffic from wireless and mobile devices will exceed traffic from wired devices by 2018. “The Internet of Things”: the number of devices connected to IP networks will be nearly twice as high as the global population by 2018. o Devices and connections are growing faster than both the world population and Internet users o This trend comes from increasing numbers of devices per person/household, and also a growing number of machine-to-machine (M2M) applications. o There will be nearly one M2M connection for each member of the global population by 2018. ZB Growth of Internet IP Traffic 1.80 1.60 1.40 Broadband speeds will nearly triple by 2018 (to 42Mbps) Globally, IP video traffic will be 79 percent of all IP traffic, for both business and consumer; all forms of video will be in the range of 8090 percent of global consumer traffic by 2018. 1.20 1.00 0.80 0.60 0.40 0.20 0.00 2008 2009 2014 2015 2018 FIGURE17.ESTIMATEDGROWTHISFROM0.75 ZETABYTES TODAYTO1.6ZETABYTESBY2018 32 Globally, mobile traffic will increase 11-fold between 2013 and 2018, reaching 191 ExaBytes annually by 2018. By 2018, about half of all fixed and mobile devices and connections will be IPV6 capable. IP traffic is growing fastest (compound annual growth rate of 38%) in the Middle East and Africa, followed by Asia Pacific. By 2018, Growth will reach 131.5 ExaBytes per month of additional traffic, distributed across regions as shown in Figure 18. Regional Distribution of New Traffic in 2018 Middle East and Africa Latin America Central and Eastern Europe Western Europe 4% 7% 8% 36% 14% North Africa Asia Pacific 31% FIGURE18.REGIONALDISTRIBUTIONOFNEWTRAFFICIN 2018 5.4. FINDINGS: DATA TRENDS Science domain experts were invited to describe data trends in their respective fields, including describing the types of data, growth and impact of international collaborations, and factors that might impact the amount of data produced by 2020, including new instrumentation, technology, or methods of analysis. Physics and climate science were described previously in the ESnet Big Data Applications Section 4.3. 5.4.1. Synopsis of Data Trends in Astronomy and Astrophysics Author: M. L. Norman, Professor of Physics, University of California San Diego and Director, San Diego Supercomputing Center Types of Data Four international astronomy projects will drive international R&E network traffic in the early 2020’s: LSST, ALMA, SKA, and JWST. The Large Synoptic Survey Telescope (LSST) will generate large (3.2 Gpixel) image files of the sky in six color bands, as well as a large object catalog. The Atacama Large Millimeter Array (ALMA) and the Square Kilometer Array (SKA) will produce raw radio interferometry data which consists of UV-plane visibilities and autocorrelations for many frequencies. The James Webb Space Telescope (JWST) will generate multicolor images and spectra of individual objects in the infrared and submm parts of the spectrum. Other types of observational astronomy data from smaller projects includes images, spectra, spectral data cubes (2 space, 1 frequency), and time series data. Virtually all observational astronomy data is stored in Flexible Image Transport System (FITS) [91] format files, which is a self-describing portable data format. 33 FIGURE19.ESTIIMATEDANNUALDATAPRODUCTIONFROMBIGSCIENCEPROJECTSBY2020.SIZEOFDOT REPRESENTSPBOFDATAPRODUCEDATTHATSITEANDNEEDINGTOBEMOVEDACROSS INTERNATIONALBOUNDARIES.REDAREINSTRUMENTS;ORANGEAREDATAREPOSITORIES;GREENIS DATACONTROLTRANSMISSION;BLUEISOTHERDATA. Astrophysical and cosmological simulations produce field and particle data in 1, 2, or 3 dimensions. Data from such simulations is output as multivariate array data and particle lists with record spatial coordinate, velocity components, mass, and other particle attributes. Many simulations have adopted the portable HDF file format [92], which is a self-describing hierarchical data format, but others output application-specific binary files. Trends from simulation to observation, or combinations of these Simulation has two meanings in this context: (1) simulated observations, used to mock up observational data for the purposes of developing/assessing automated analysis pipelines; and (2) astrophysical simulations. The above mentioned astronomy projects make extensive use of simulations of the first kind. In the run-up to full science operations, large amounts of simulated observations are generated—comparable in size to actual data—to test the readiness of the distributed data processing infrastructure. Increasingly, large cosmological simulations are being performed for the purpose of designing/optimizing observing strategies for measuring dark energy or probing the epoch of reionization. Trends in international collaboration and/or international data sharing All the projects mentioned above are international due to their high cost and one-of-a-kind nature. The main international partners for LSST are Chile, France, and the Czech Republic. The main international partners for ALMA are Chile, European Southern Observatory, and Japan. Main international partners for SKA are Australia, Canada, China, Germany, Italy, New Zealand, South Africa, Sweden, the Netherlands and the United Kingdom. The US is not participating at the present time. The main international partners for JWST are the European Space Agency and 34 Canada. All these projects involve international data grids of some sort where science data is replicated in archives operated by the member nations. The primary LSST archive will be at NCSA, University of Illinois Urbana-Champaign and will be replicated in Chile. NCSA and Chile will also host DACs (science Data Access Centers). The primary ALMA data archive will be located at the National Radio Astronomy Observatory (NRAO) [93] in Charlottesville, VA, with satellite archives operated in France and Japan. The SKA instruments will be located in Australia and South Africa. Data from Australia will travel from Perth to London via the USA. Data from South Africa will travel directly to London, where the European distribution point is located. An LHC tiered data grid is envisioned to distribute the data from London to national Tier 1 data centers. Since Canada is a partner, it is expected it will host such a Tier 1 center, which creates the possibility that SKA data will flow to Canada via the USA. Numerical simulators are emulating their observational counterparts by deploying online “numerical observatories” which are Structure Query Language (SQL) [94] and No-SQL [95] databases which serve up raw and derived data products over the Internet. Three conspicuous examples are the Millennium Simulation database [96], the Bolshoi Simulation MultiDark database [97], and the Dark Sky Simulations database [98]. New instruments, methods or use modalities that are scaling up current activities LSST has a 3.2 Gpixel CCD array camera that will photograph the night sky every 10 seconds, producing a dataflow of 15 TB of image data per night. With expected 10-year project duration, some 200 PB of data will be amassed by the end of the project. ALMA is a new large radio interferometer that is in operation now, and will continue through the next decade. ALMA data link speeds from Santiago, Chile to Charlottesville, VA will increase from ~100 Mb/s now to ~1 Gb/s by 2016, and then to 2.5 Gb/s by 2020 as more data-intensive observations are planned. SKA will produce an unprecedented data rate of ~15 Pb/s aggregate between the receivers and the central correlator. However none of this will touch international research networks. Processed science data is estimated to flow into the European distribution point at a sustained rate of ~100 Gb/s and thence onto the national Tier 1 data centers at 30 Gb/s each. No estimates are available for JWST data, but since this is a space observatory the primary telemetry data does not enter the US on the wired Internet. Heavy international access is anticipated to the MAST archive located at the Space Telescope Science Institute in Baltimore, MD. New technology that substantially reduces cost of collecting/retaining data. The National Center for Supercomputing Applications (NCSA) [99] has recently deployed the largest High Performance Storage System (HPSS) [100] tape archive for open science data. HPSS has added capability for Redundant Arrays of Independent Tapes (RAIT)—tape technology similar to RAID [101] for disk. RAIT dramatically reduces the total cost of ownership and energy use to store data without danger from single or dual points of failure through generated parity blocks. It also enhances the performance of data storage and retrieval since the data is stored and read/written in parallel. Important bottlenecks in creating, storing, or making better use of data The bottlenecks are myriad. Aside from the Blue Waters project [102] at NCSA, the NSF has not funded an XSEDE-wide High Performance Computing (HPC) data archive for data preservation. 35 As a consequence archives at many XSEDE Service Provider sites have aged out and have been decommissioned. Without a means to preserve data, there is no reason to create data access/data discovery facilities around the data. HPC data largely exists as single copy data resident on the Lustre file systems [103] of geographically-distributed HPC systems. Such data is not discoverable or web-accessible. Researchers are turning to cloud providers to address this problem [98], although it is unclear who will pay the costs for long-term data storage and retrieval. 5.4.2. Data Trends in Bioinformatics and Genomics Author: Stephen A. Goff, The iPlant Collaborative [104], BIO5 Institute, University of Arizona, Tucson AZ Genes, genomes, and traits have been studied and analyzed for over a century, beginning well before biologists agreed that DNA was the genetic material. Recent advancements in automated sequencing, high-throughput phenotyping, and molecular phenotyping technologies are rapidly bringing biology into the “big data” era. DNA Sequencing technology is the prime example of this advancing technology and is the gold standard of gene and genome identification and analysis. Types of data The types of data important for biologists include DNA sequence data, RNA sequence data, phenotypic or trait data, environmental data, ecological data, and phylogenetic data. New instruments, methods or use modalities that are scaling up current activities and new technology that substantially reduces cost of collecting/retaining data. DNA sequencing technology has decreased in cost approximately five orders of magnitude and consequently DNA sequence data is increasing faster than exponential at this time. Several thousand species have been sequenced (out of a few million species total) and the technology is now being applied to varieties of crops to empower molecular breeding. Rice, as an example, has over 200,000 known varieties and a few thousand rice lines have been sequenced and made publicly available. The raw data from these few thousand varieties requires approximately 20 terabytes of storage space and about a few tens of terabytes scratch disc space for analysis purposes. It is expected that hundreds to several thousand varieties of important crop plants will be sequenced in the near future and this will generate petabyte levels of raw data to be stored and analyzed. In addition to crop genomes and varieties, a few thousand plants from diverse groups of species have been or are being sequenced. These represent a small percentage of the estimated 500,000 green plant species, but cover a broad range of diversity. Likewise for animals, a broad range of diverse species has been or will be sequenced in the near future. Advanced and inexpensive DNA sequencing technology is also being used to study gene expression under different conditions. Instead of sequencing the genomes, RNA from expressed genes (both protein-coding and non-coding) is purified and converted to DNA then sequenced. This allows researchers to determine which genes are on in specific environmental conditions and how an organism responds to a changing environment by changes in gene expression. This technology is known as “RNA-Seq” and is the technology generating the highest amount of raw data with modern sequencers. A typical, well-designed RNA-Seq experiment will generate 1-2 36 terabytes of data and with the current estimates of new instruments in use today, hundreds to thousands of terabytes of raw data could be generated daily. RNA-Seq is an example of highthroughput “phenotyping”, where the phenotype under study is gene expression in a specific condition. This technology is mainly applied in the developed countries, but will soon be applied more globally as new sequencing instruments are delivered to dispersed locations. One major trend is for sequencing to be done locally at a large number of research institutes versus at primarily at a few large sequencing “centers”. In addition to genome sequencing and gene expression profiling, high-throughput sequencing is also being used to determine the offspring “genotype” created by breeding specific parental lines of crops or livestock. This use is designated “genotype by sequencing” and is a technology poised to create a large amount of data (petabyte levels, especially in the private sector) as molecular breeding technology gains momentum. In the public sector, this data is likely to be generated mainly by academic institutions focused on agriculture and livestock improvement as well as the USDA. It is beginning to be supported by humanitarian foundations interested in improving crops in developing countries where conventional plant and livestock breeding has not progressed as it has in the western world. In addition to RNA-Seq, there are several high-throughput phenotyping technologies that generate significant datasets but are only beginning to be adopted broadly. Perhaps the most obvious is in the field of imaging. Images from satellites are being used to detect pathogen spread through fields on a large scale. Images are also being used to study the growth, development and health of crops in specific fields. Microscopic images are being used to study gene expression over time in live cells, and responses to environmental changes. Use of image technology in biological research is expected to create petabyte levels of raw data. Medical images are one of the largest single datasets at approximately 100 petabytes. Laser images are being used to create three-dimensional patterns of growth in field crops (LIDAR for example) and a single pass over a typical sized breeding station plot (10 hectares) is estimated to generate 3-4 terabytes of raw data. There are thousands of such breeding station plots nationally, and to get an accurate assessment of growth over time, daily measurements would likely be taken. This data would need to move from dispersed field stations to a central analysis repository and would require reasonably broad bandwidth and compute for complete analysis. Another approach to phenotyping is “molecular phenotyping”. Each cell has several thousand small molecule “metabolites” that are constantly in a state of flux. These small molecules provide an indication of the health of the cell, the growth state, and the response to environmental perturbations. Analyzing these small molecules is designated “metabolite profiling” or “metabolomics” and consists of running several chemical extracts through mass-spectrometers followed by computational analysis of the resulting spectral patterns. The data create by metabolite profiling is currently fairly small (gigabytes per run), but has the potential of becoming quite large as the technology matures and is applied to much larger target species and conditions. It’s difficult to estimate the ultimate data size accurately at this point. Trends from simulation to observation 37 Biology is moving from observation toward simulation and modeling. Biological organisms are highly adapted, influenced by and tuned to environmental variables. In other words, specific organisms are very dynamic and to understand biological mechanisms it is necessary to consider and integrate environmental variables. The implication is that much more data will need to be collected over time and analyzed to fully understand and accurately simulate how organisms interact with the environment. Trends in international collaboration and international data sharing Larger and larger teams that cross international borders are doing biological science research. These collaborations share genotypic and phenotypic datasets that are growing increasingly large. Sharing terabyte levels of data is much more common today. Likewise, funding agencies are strongly encouraging the sharing of raw biological research data across international groups. Bottlenecks in creating, storing, or making better use of data The main bottleneck in data creation, storage and usage is sharing large datasets generated from dispersed research institutions, storing the data in consistent formats and annotating the data with standard metadata descriptions to make it the most useful for researchers who were not involved in the design of the experiments of creation of the raw data. 5.4.3. Data Trends in Earth, Ocean, and Space Sciences Author: Sara Graves, Director, Information Technology and Systems Center and Professor of Computer Science, University of Alabama in Huntsville The emerging "data rich" era in scientific research presents new challenges and opportunities for research. “Big data” can be described with four properties: volume (huge amounts data that is constantly growing), variety (many types of data), velocity (ability to be available and used quickly), and veracity (trusting data for decision-making). Data is fundamental to research and technologies to discover, access, integrate, manage, transform, analyze, visualize, store, share and curate data are core capabilities needed to exploit the wealth of available data. With a wealth of data, and new tools and technologies, scientists can explore new ways to gain new knowledge and solve problems that were previously not possible. Types of Data Geoscience researchers are working with a variety of globally distributed data, including field observations and sensor-based terrestrial and/or airborne observations, simulation results and experiment data. Many researchers work with derived data products, instead of primary data, for combining with data in their own analyses. As science becomes more interdisciplinary, scientists need data across multiple fields that are traditionally separate in the geosciences. In addition to traditional scientific data, scientists require new analytics tools to process, analyze, and derive knowledge from various structured and unstructured data, including text, audio/voice, imagery, video, and multimedia. Access to real-time data is becoming more important as computing capabilities improve and data becomes available. In addition to acquiring data in real-time, data can be transformed and delivered to a user based on the user’s needs, dynamically creating new, 38 virtual data sets. Real-time data processing can support decision-making in real-time, presenting new opportunities for a variety of applications with critical real-time requirements. Mobile computing is also growing, with scientists now taking advantage of having the capability to acquire or input data in the field. Trends from simulation to observation, or combinations of these While there are multiple communities in the geosciences, each with different research needs and styles, there are fundamental overarching, crosscutting needs that are common to each. EarthCube [105], an NSF program, is working towards the goal of transforming the conduct of research in geosciences and encouraging the community-guided cyberinfrastructure development of a coherent framework to integrate and use data and information for knowledge management across the entire research enterprise. NSF is engaging the geoscience community to develop a governance process to establish an organization and management structure to guide priorities and system design, and groups were established to address identified areas of need. Within EarthCube, the Data Discovery, Mining and Access (DDMA) group [106] was formed to address the data needs for geoscience research. The DDMA group conducted a series of virtual meetings with a diverse spectrum of the research community to exchange ideas, experiences and knowledge and to coordinate analysis and development of a roadmap that addresses data challenges for the geosciences community. The DDMA roadmap addresses current issues and needs concerning data access, discovery and mining for the geosciences. NASA’s Earth Science Data Systems Working Group (ESDSWG) [107] is focused on developing ideas from community inputs for working with NASA Earth science data. Currently, ESDWG is looking at a wide range of data issues, including cloud computing, collaborative environment, open source software, visualization, HDF5 conventions, preservation and provenance. Trends in international collaboration and/or international data sharing There are an increasing number of international collaborations concentrating on exploiting the use of science data. The International Council for Science: Committee on Data for Science and Technology (CODATA) [108] is an international organization aimed at strengthening science, with a particular emphasis on data management. Through working groups, conferences, workshops and publications, CODATA brings together scientists and engineers to address many of the issues related to data quality, accessibility, acquisition, management and analysis for a variety of scientific disciplines. The American Geophysical Union (AGU) [109], promoting discovery and collaboration in Earth and space science, brings together scientists from over 140 countries. With over 60,000 members, AGU hosts meetings in the spring and fall and publishes journals, books and articles to disseminate scientific information. The AGU Earth and Space Science Informatics (ESSI) [110] focus group addresses many issues related to data and information. Across the geosciences, common data standards often mentioned at AGU meetings include Open Geospatial Consortium (OGC) Standards [111], ESIP Federation Open Search [112], Open-source Project for a Network Data Access Protocol (OPeNDAP) [113], Metadata (FGDC, ISO 19115 [114]), File Formats (HDF, 39 NetCDF [115]). The Federation of Earth Science Information Partners (ESIP) [116] is a distributed community focused on Earth and environmental science data and information. Through a networked community of stakeholders, the ESIP Federation drives innovation by promoting collaboration to investigate issues concerning Earth science data interoperability. The Integrated Research on Disaster Risk (IRDR) program [117] is a multinational research effort investigating the challenges, preparedness, risk reduction and mitigation of natural disasters. IRDR is taking an interdisciplinary approach to apply science to identify risks of disasters, facilitate decision-making, and reduce risks from disasters across the world. New technology that substantially reduces cost of collecting/retaining data Cloud Computing can provide computing resources for research to those that may not have access to computing resources at their facility. In addition to cost savings and increased IT agility, cloud computing is currently being used to provide on-demand access for scientific research and also helps to address issues of sharing, scalability and management. Likewise, the cost of large, spacebased and airborne science platforms to acquire data can be prohibitive. Efforts are underway to provide smaller, less-expensive platforms to introduce more opportunities for research. Difficulties due to insufficient network infrastructure also remain problematic. While Open Data is desired for research both nationally and internationally, policies, rules and regulations can inhibit the production and use of Open Data. Charging for data, copyrights and patents forbidding re-use and proprietary technologies also hinder the use of Open Data. New instruments, methods or use modalities that are scaling up current activities The explosion of big data and advances in data technologies are introducing new and exciting opportunities in interdisciplinary research. Researchers are combining science data with data from social sciences, humanities, health and other fields to create new knowledge and advance discovery in ways previously unimagined. Advances in semantics-based technologies are needed to address the challenges associated with increasingly complex and distributed heterogeneous data. In addition to traditional science data, researchers are now also working image, geospatial, text and multimedia data. Data provenance is increasingly important as the complexity of data increases. Metadata is essential to the discovery and preservation of data, and research in metadata can help to address the challenges associated with data discovery, access and use. Scalable and interoperable annotation, query and analysis can serve to exploit data in new ways. The interactive exploration and visualization of large data sets with access to data is needed to accelerate development of new techniques of working with data and advance scientific discovery. Enabling effective collaborative research is also being explored. Collaborations can scale from individuals sharing science resources, to sharing within groups such as science mission teams, to sharing with an entire science community. New collaboration tools advance science by supporting sharing of data, tools and results, enhancing productivity and enabling new areas of research. Collaboration tools are needed to facilitate data search, access, analysis and visualization, as well 40 as to assist researchers with developing and sharing workflows, publishing results. Together, these tools improve research capabilities and increase opportunities for the discovery of new knowledge. Bottlenecks in creating, storing, or making better use of data The ability to integrate, understand and manage data is a primary objective. While there are numerous tools available, tools need to fit the way scientists perform their research. The scientist may need to discover new datasets that they may not have considered before. Gathering relevant data and information for case studies and climatology analysis is tedious and time consuming. The need exists for tools to filter through large volumes of online content and gather relevant information based on a user’s science needs. New methods of search include content-based discovery of disparate and globally distributed datasets. For instance, the design of current Earth Science data systems assumes researchers access data primarily by instrument or geophysical parameter. But case study analysis and climatology studies commonly used in Atmospheric Science research are instances where researchers are studying a significant event. Data organized around an event rather than the observing instruments can provide results more targeted to the scientist’s research focus. Usability is an issue frequently encountered by scientists. To be effective, tools need to be easyto-use, efficient and effective. It is especially important to “lower the barrier to entry” for users when using a tool for the first time. In addition to lowering the barrier to entry, usability improvements to tools can increase productivity and allow for new, interdisciplinary science. The use analysis tools on the data subset, and visualize and share the results can provide increased capabilities for research. Much of the scientist’s time is spent on data discovery, acquisition and preparation. While new tools and techniques are needed for working with data, the use of specialized tools and techniques is often outside the experience of scientists and can hinder the effective use of data. Easy-to-use, sustainable and reusable tools can enable the scientist to overcome the complex challenges associated with diverse, distributed and heterogeneous data and spend more time focusing on their research. Validation of simulation data with observation data and in situ data is also needed. Standards and interoperability play a fundamental role in working with data. The seamless interplay of data and tools can simplify work for the scientist and increase productivity. Data formats and communication protocols enabling the discovery, exchange and analysis of data is essential, but determining which formats should be standard continues to be a challenge. user interface continues to be an important area of research. For instance, a workbench environment from which a scientist could find a data set, subset it to get just the region of interest, With the increasing volume of data, I/O bottlenecks and scalability continue to be troublesome. Computing has been improved with multi-core processors, but access to attached data storage can be slow, and networks can introduce latency to data access times. Software such as Apache 41 Hadoop, with modules for scheduling, parallel processing and distributed file systems offer promise for complex, large-scale research applications dealing with resource constraints. New drivers to international data sharing in bottom-up way Author, Beth Plale, Professor of Informatics and Computing and Director, Data to Insight Center, Indiana University The Global Lake Ecological Observatory Network (GLEON) [118] is an example of an emerging source of international traffic. Ecologists who received tenure conducting studies and data gathering about a single lake or two formed an organization around the thesis that there was research benefit to collaboration among people studying lakes. As the 10 years of GLEON progressed, it became wildly successful in one important way; it created a generation of ecologists who received PhDs on cross-lake research, that is, research questions that involve the study of multiple lakes in diverse ecological settings, often across the world. The GLEON organization/community facilitated reduction of the social barriers to data sharing that previously hindered the larger-scale research. GLEON is a leading exemplar of larger science emerging from the organization and community building going on amongst smaller science researchers. Taken as a whole, this is a trend that will continue. Research Data Alliance (RDA) [119;120] is seeing interest by these smaller scale science groups as a venue for international community building to go on, and for progress on technical issues in data interoperability to take place. New drivers to international data sharing are also occurring in top-down way: CoopEUS [121] is an EU/US joint funding to facilitate international cross-disciplinary interoperability and data exchange. Lindsay Powers at NEON Inc. is CoopEUS Scientist and represents EarthCube to the leadership council. The Research Data Alliance (RDA) builds the social and technical bridges that enable open sharing of data. Beth Plale is an RDA Steering Group Member Group On Earth Observations Global Earth Observation System of Systems (GEO/GEOSS) [122-124] is a ‘system of systems’ that will proactively link together existing and planned observing systems around the world and support the development of new systems where gaps currently exist. It will promote common technical standards so that data from the thousands of different instruments can be combined into coherent data sets. The International Council for Science (ICSU) World Data System [125] promotes universal and equitable access to, and long-term stewardship of, quality-assured scientific 42 data and data services, products, and information covering a broad range of disciplines from the natural and social sciences, and humanities. The Data Enabled Life Sciences Alliance (DELSA) [126] provides a leading voice and coordinating framework for collective innovation in data-enabled science for the life sciences community, and facilitates information exchange through workshop development, community involvement and publications. 5.4.4. Data Trends in Computer Science Author: Geoffrey Fox, Professor of Informatics and Computing and Physics Indiana University Types of Application and Data Computer Science covers a broad range of applications – some in domains outside computer science but often performed in computer science. One example of this is “Network Science” (sometimes called Web Science, Complex Systems or just Social networking) where researchers largely from social science, computer science and physics work. Further, much work is interdisciplinary with computer science and application science researchers working together. For example, computer vision researchers could work with earth scientists on analysis of satellite data, medical scientists on pathology and with CS artificial intelligence community on interpreting images for robots and in particular self-driving (autonomous) vehicles. This exemplifies the field of cyberphysical systems that is growing in importance and a major driver of network traffic. A lot of current computer science areas are collaborative with Industry in areas critical to today’s digital community world, including search, recommender engines, deep learning (artificial intelligence), robotics, Human Centered Interfaces, and social networking. . In addition, today’s cloud and enterprise data centers exhibit many distributed system and HPC research discoveries while all is connected by networking that exploits new research technologies like OpenFlow and is protected by Cybersecurity research. There is also substantial CS research on military applications that builds on growing use of UAV’s and other sensors and their analysis for command and control functions. Analysis of social media data is also of great interest to intelligence community and another example of CS-defense interactions. The size of commercial data is known with some useful numbers being ~6 zettabytes of shared digital data today; 1.8 billion photos uploaded every day to social media sites like Facebook [127], Instagram [128], WhatsApp [129] and Snapchat [130] (Flickr [131] is negligible); 24 billion devices on the Internet by 2020; annual IP traffic will pass one zettabyte/year mark in 2016 (18% business, rest consumer); YouTube [132] and Netflix [133] total around 50% of network traffic in USA. The total data is growing roughly likes Moore’s law [134] but sub categories like the “Internet of Things” are growing much faster than that. Details of Applications More detail can be found in 51 use cases gathered by the National Institute of Standards and Technology (NIST) [135] at “Big Data Use Cases V1.0 Submission” [136] and “Draft NIST Big Data Interoperability Framework: Volume 3, Use Cases and General Requirements” [137]. These cover industry, government and academic applications and show many computer science 43 areas that are presented in more detail than that above. There are also many science domain applications in the NIST report, which are covered in other parts of this report. Direct computer science research is covered in NIST use cases 6-8 (commercial search and recommender systems), 13-15 (military), 16-25 (healthcare), 26-31 (Deep learning and Social media), 32 (metadata technology), 10 and 49-51 (Internet of things). The NIST report gives key characteristics of data with most current numbers smaller than a petabyte for research use. The commercial cloud sites are 6 zettabytes total but other interesting large cases include medical diagnostic imagery (70 petabytes in 2011 going over an Exabyte if cardiology included) and military surveillance sensors gathering petabytes of data in a few hours. The NIST benchmark repository (use case 31) for data analytics is currently 30TB but expected to increase. There are many Twitter [138] research projects as it is possible to get good tweet samples (through Gnip [139]) whereas other commercial sites are less forthcoming. Use case 28 estimates that it collects ~100 million messages per day, (~500GB data/day) increasing over time for social media studies. This illustrates well the importance of streaming data. The content comes from people (as in search and social media) and devices like the sensors in “smart” grids, vehicles, rivers, cities and homes. Development of wearables (Google Glass [140] and Apple Watch [140] are visible examples) that supplement smart phones with even now the Samsung Galaxy S5 [141] having 10 distinct sensors, will increase interest in research into the Internet of Things. Streaming applications like these are present in 80% of the NIST use cases and are clearly changing system software (Apache Storm [142] is stalwart but Google MillWheel [143] and Amazon Kinesis [144] are recent high profile announcements), network/computer use and the spurring a re-examination of algorithms. The scale of commercial problems is spurring a new type of research which sometimes gets bad press. With A/B testing [145], new algorithms are well tested by just switching a tiny fraction of say Netflix users to a new software or algorithm. That can be quickly compared with the rest of world running the standard release. Successful trials are then incorporated in basic release and so the so-called perpetual beta software of cloud applications continues. 5.5. FINDINGS: ONLINE CATALOG OF INTERNATIONAL BIG DATA SCIENCE PROGRAMS A survey of international big science programs was conducted; the resulting “Catalog of International Big Data Science Programs” is available as an interactive web site at http://irnc.clemson.edu. The site aggregates information gathered from interviews with network providers, domain scientists, the NSF PI survey, web site searches, and the I2 Big Science List. The web site provides information regarding over 100 international science collaborations involving Big Data and large numbers of scientist (50 or more) through visualization of locations and data size, and filtering by discipline or location. Below, a very few programs are described to help paint a picture of the special application requirements in scientific research that drive the need for R&E networks. 5.5.1. Large Hadron Collider : 15 PB/year (CERN) 44 The High Energy Physics (HEP) community has largely driven network backbone capacity to current levels -- 40 -100G + (TB in Europe). HEP is the set of scientists who design, execute, and analyze experiments requiring use of the LHC facility at CERN. There is only one such place in the world, so those who want to participate in the experiments and results have over 20+ years formed an international, well-organized community around these experiments. The computational people within the HEP community are computer scientists, programming physicists, experimental equipment designers/operators & so forth. This community meets regularly at the International Conference on Computing in High Energy and Nuclear Physics (CHEP) [146]. The community shares the LHC instrument, located near Geneva and spanning the Franco-Swiss border, about 100 meters underground. It is a particle accelerator used by physicists to study the smallest known particles – the fundamental building blocks of all things. Major experimental groups are the ATLAS Experiment [147], the Compact Muon Solenoid (CMS) [148], and A Large Ion Collider Experiment (ALICE) [149]. The instrument is closed and will re-open in 2015. The community is a collaboration of 10,000 scientists and engineers from over 100 countries, organized around the six types of experiments at the LHC. Each experiment is distinct, characterized by its unique particle detector. Because the LHC is a scarce resource, experiments are scheduled years in advance (way past 2020) and the community already knows how much data will be produced and distributed as a result of those experiments. The community has a well-established Tiered Model for data distribution, currently more of an inherited situation than a technical necessity. Up until recently, CalTech was responsible for planning and implementing network connections for HEP. CERN (Tier 0) transfers data to “Tier 1” facilities (Brookhaven and Fermi National Labs); from there, data is re-distributed to “Tier 2” facilities, located at universities. The community is moving unevenly to more of a mesh; sites don’t need to specialize in one kind of data/one kind of analysis. A MonaLisa [150] repository maintained at CalTech monitors all data flowing over all relevant backbone segments. Traffic graphs show most data movement is Tier 2 to Tier 2 now. HEP currently maintains ~30PB of data (20PB of which are on-line accessible); in comparison, Netflix has only 12TB of data and comprises 30% of all traffic on the commercial Internet. HEP’s on-line data is used at an average of 0.8Tbps. The community’s expectations are that a 5 minute delay is acceptable; bandwidth required is 2 Gbps average and 10 Gbps max. 1.1.1 The Daniel K. Inouye Solar Telescope (DKIST): 15 PB/year The Daniel K. Inouye Solar Telescope (DKIST, formerly the Advanced Technology Solar Telescope, ATST) [151] will address questions such as: What is the nature of solar magnetism; how does that magnetism control our star; and how can we model and predict its changing outputs that affect the Earth? The facility is located at Haleakala Observatories, Hawaii. Currently in construction, DKIST represents a collaboration of 22 U.S. institutions, reflecting a broad segment of the solar physics community. Depending on final design of the instruments, there will be 5-15 petabytes of data per year. For 45 their mission, latency and timeliness are not that important. This translates to just under 5 Gbps average over the course of a year needed to move the data from Maui to National Solar Observatory HQ in Boulder. 1.1.2 Dark Energy Survey (DES): (1 GB image, 400 images per night, instrument steering). The Dark Energy Survey (DES) [152] is designed to probe the origin of the accelerating universe and help uncover the nature of dark energy by measuring the 14-billion-year history of cosmic expansion with high precision. This collaboration is building an extremely sensitive 570Megapixel digital camera (DECam) and will mount it on the Blanco 4-meter telescope at Cerro Tololo Inter-American Observatory high in the Chilean Andes. Starting in Sept. 2012 and continuing for five years, DES will survey a large swath of the southern sky out to vast distances in order to provide new clues to this most fundamental of questions. More than 120 scientists from 23 institutions in the United States, Spain, the United Kingdom, Brazil, and Germany are working on the project. The DES-Brazil is a member of the international cooperation Dark Energy Survey (DES) led by Fermilab, NCSA and NOAO. Each DECam image is a gigabyte in size. The Dark Energy Survey will take about 400 of these extremely large images per night. This presents a very high data-collection rate for an astronomy experiment. The data are sent via a microwave link to La Serena. From there, an optical link forwards them to the National Center for Supercomputer Applications (NCSA) in Illinois for storage and "reduction". The project to date requires small file transfer and occasional large file transfers at speeds between 100Mpbs to 1 Gbps. Reduction consists of standard image corrections of the raw CCD information to remove instrumental signatures and artifacts and the joining of these images into 0.5 square degree "combined images". Then galaxies and stars in the images are identified, catalogued, and finally their properties measured and stored in a database. 1.1.3 Large Square Kilometer Array (Australia & South Africa) (100 PB per day) The Square Kilometer Array (SKA) will be the world’s largest and most sensitive radio telescope, addressing fundamental unanswered questions about our Universe including the first stars and galaxies formed after the Big Bang, how galaxies have evolved since then, the role of magnetism in the cosmos, the nature of gravity, and the search for life beyond Earth. The SKA is being built in Australia and South Africa. The total collecting area of the SKA will be one square kilometer, or 1,000,000 square meters. This will make the SKA the largest radio telescope array ever constructed, by some margin. To achieve this, the SKA will use several thousand dish (high frequency) and many more low frequency and mid-frequency aperture array telescopes, with the several thousand dishes each being 15 metres in diameter. Rather than just clustered in the central core regions, the telescopes will be arranged in multiple spiral arm configurations, with the dishes extending to vast distances from the central cores, creating what is known as a long baseline interferometer array. Construction Phase is 2016-2024. Prototype systems will begin operation in 2016. The instrument is expected to produce 100 PetaBytes of data per day. Collaborators are from neighboring African countries, Australia, New Zealand, Canada, China, Germany, Italy, Netherland, Sweden, and India). 46 5.6. FINDINGS: SURVEY OF NSF-FUNDED PIS An on-line survey of NSF funded Principal Investigators whose projects’ initial funding commenced in years 2009-2014 was conducted in August 2014. A total of 4,050 PIs responded, representing a response rate of 13%, and over half of those who responded participate in one or more international science collaborations (Figure 20 A). The representation of NSF program areas represented among the respondents is shown in 20 B. Over half of respondents participate in international science collaborations 14% Math and Physical Sciences Biological Sciences Social, Behavioral and Economic… Computer & Information Science GeoScience Engineering 86% A Education & Human Resource… 0 B 200 400 600 800 1000 1200 FIGURE20.AMAJORITYOFNSFFUNDEDINVESTIGATORSPARTICIPATEINONEORMOREINTERNATIONAL COLLABORATIONS The complete survey questions can be found in Appendix B, and survey results can be found in Appendix C. The top ten countries where shared data are created, processed, or stored (multiple FIGURE21.TOPTENCOUNTRIESSOURCINGDATA 47 countries could be selected by each respondent). Countries where shared data is created, processed, or stored (multiple countries could be selected by each responder) are shown in Figure 21. Of the 196 countries listed in the survey, 179 were named by respondents. Only 12% of respondents indicated that they had difficulty collaborating due to network limitations; for those who had this problem, the top countries mentioned are indicated in Figure 22. FIGURE22.COUNTRIESWHEREUSINVESTIGATORSEXPERIENCENETWORKINFRASTRUCTUREISSUES 48 6. REFERENCES [1] "Large Synoptic Survey Telescope (LSST)," 2015. http://www.lsst.org/lsst/ [2] "Atacama Large Millimeter/submillimeter Array (ALMA)," 2015. http://www.almaobservatory.org/ [3] "Square Kilometer Array (SKA Telescope)," 2015. https://www.skatelescope.org/ [4] "James Webb Space Telescope (JWST)," 2015. http://www.jwst.nasa.gov/ [5] "Large Hadron Collider (LHC)," 2015. http://home.web.cern.ch/topics/large-hadron-collider [6] "International Thermonuclear Experimental Reactor (ITER)," 2015. http://www.iter.org/ [7] "Belle Detector Collaboration," 2015. http://belle2.kek.jp/detector.html [8] National Oceanic and Atmospheric Administration, "What is LIDAR?," 2015. http://oceanservice.noaa.gov/facts/lidar.html [9] [10] "GLORIAD Insight," 2015. https://insight.gloriad.org/insight/ "Domain Name System," 2015. http://en.wikipedia.org/wiki/Domain_Name_System [11] P.Mockapetris, "Domain Names - Concepts and Facilities (IETF RFC 1034)," 1987. https://www.ietf.org/rfc/rfc1034.txt [12] "Global Integrated Lambda Facility (GLIF)," 2014. http://www.glif.is/ [13] "Border Gateway Protocol 4 (BGP-4),". Y.Rekhter, T.Li, and S.Hares, Eds. 2006. https://tools.ietf.org/html/rfc4271 [14] "Networking 101: Understanding BGP Routing," 2015. http://www.enterprisenetworkingplanet.com/netsp/article.php/3615896/Networking-101Understanding-BGP-Routing.htm [15] "Software Defined Networking (SDN)," 2013. http://en.wikipedia.org/wiki/Softwaredefined_networking [16] "Center for Applied Internet Data Analysis (CAIDA)," 2014. http://www.caida.org/home/ [17] "ACE/TransPac3," 2013. http://internationalnetworking.iu.edu/ACE [18] "Delivery of Advanced Network Technology for Europe (DANTE)," 2015. http://www.dante.net/Pages/default.aspx [19] "Trans-European Research and Education Networking Association (TERENA)," 2015. https://www.terena.org/ 49 [20] "New York State Education and Research Network (NYSERNet)," 2015. https://www.nysernet.org/ [21] "Internet2," 2014. http://www.internet2.edu/ [22] "AMLight," 2013. http://www.amlight.net/ [23] "Latin American Cooperation of Advanced Networks (RedCLARA)," 2015. http://www.redclara.net/index.php/en/ [24] "U.S. Department of Energy, Energy Sciences Network (ESnet)," 2015. http://www.es.net/ [25] "Canada's Advanced Research and Innovation Network (CANARIE)," 2015. http://www.canarie.ca/ [26] "GLORIAD ," 2013. http://www.gloriad.org/gloriaddrupal/ [27] "National Research Centre "Kurchatov Institute"," 2015. http://www.nrcki.ru/e/engl.html [28] "Korea Institute of Science and Technology Information (KISTI)," 2015. http://en.kisti.re.kr/ [29] "Chinese Academy of Sciences," 2015. http://english.cas.cn/ [30] "SURFnet," 2015. https://www.surf.nl/en/about-surf/subsidiaries/surfnet [31] "Nordic Infrastructure for Research and Education (NORDUnet)," 2015. https://www.nordu.net/ [32] "IceLink TransAtlantic Polar Network," 2015. https://ftp.nordu.net/ndnweb/nordunet___canarie_announces_icelink.html [33] "Hong Kong Open Exchange Portal (HKOEP) ," 2015. http://www.nren.nasa.gov/workshops/pdfs9/PanelD_OpticalTestbedsinChina-Haina.pdf [34] "Egyptian National Scientific and Technical Information Network (ENSTINET)," 2015. https://www.b2match.eu/scienceforsociety/participants/10 [35] "Telecomm Egypt (TEGROUP)," 2015. http://www.itu.int/net4/ITUD/CDS/SectorMembersPortal/index.asp?Name=42883 [36] "Tata Communications," 2015. http://www.tatacommunications.com/ [37] "National Knowledge Network (NKN)," 2015. http://www.nkn.in/ [38] "Singapore Advanced Research and Education Network (SingAREN)," 2015. http://www.singaren.net.sg/ [39] "Vietnam Research and Education Network (VinAREN)," 2015. http://en.vinaren.vn/ [40] "TransLight / Pacific Wave," 2013. http://www.hawaii.edu/tlpw/ [41] "TransPac," 2014. http://internationalnetworking.iu.edu/initiatives/transpac3/index.html 50 [42] "Asia Pacific Advanced Network (APAN)," 2015. http://www.apan.net/ [43] "TransLight/StarLight," 2014. http://www.startap.net/translight/pages/about-home.html [44] " TCP/IP Protocol Suite ," 2015. http://en.wikipedia.org/wiki/Internet_protocol_suite [45] Eli Dart, "The Science DMZ," 2015. http://www.internet2.edu/presentations/ccnie201404/20140501-dart-sciencedmz-philosophyv3.pdf [46] "Network Startup Resource Center," 2014. http://www.nsrc.org/ [47] "Network Service Interface (NSI)," 2015. https://redmine.ogf.org/projects/nsi-wg [48] "Open Grid Forum (OGF)," 2015. http://www.gridforum.org/ [49] "Global Environment for Network Innovations (GENI)," 2015. http://www.geni.net/ [50] "Advanced Layer 2 Services (ALS2)," 2015. http://www.internet2.edu/productsservices/advanced-networking/layer-2-services/layer-2-services-details/ [51] " Global Telecommunications System (GTS)," 2015. http://www.wmo.int/pages/prog/www/TEM/GTS/index_en.html [52] "HiSeasNet," 2015. http://hiseasnet.ucsd.edu/ [53] "NASA Space Network," 2015. http://www.nasa.gov/directorates/heo/scan/services/networks/txt_sn.html [54] "Hubble Space Telescope," 2015. http://hubblesite.org/ [55] "NASA's Earth Observing System," 2015. http://eospso.nasa.gov/ [56] "NASA International Space Station," 2015. http://eospso.nasa.gov/ [57] "Internet2 International Big Science List," 2015. https://www.internet2.edu/media/medialibrary/2014/01/22/The_International_Big_Science_List.p df [58] "UDP-Based Data Transfer (UDT)," 2015. http://udt.sourceforge.net/ [59] Nick McKeown, tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner, "OpenFlow: enabling innovation in campus networks," SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, pp. 69-74, 2018. [60] "OpenFlow," 2015. http://archive.openflow.org/wp/learnmore/ [61] "Square Kilometer Array (SKA)," 2015. https://www.skatelescope.org/ [62] "perfSONAR PERFormance Service Oriented Network monitoring ARchitecture ," 2013. http://www.perfsonar.net/ 51 [63] "CAVEwave," 2015. https://www.evl.uic.edu/entry.php?id=580 [64] "CineGrid," 2015. http://www.cinegrid.org/ [65] "GLORIAD "Insight"," 2015. https://insight.gloriad.org/insight/ (live and recent data) https://insight.gloriad.org/history/ (historical data) [66] "Chiina Science and Technology Network (CSTnet)," 2015. http://indico.cern.ch/event/199025/session/2/contribution/42/material/slides/0.pdf [67] "File Transfer Protocol (FTP)," 2015. http://en.wikipedia.org/wiki/File_Transfer_Protocol [68] J.Postel and J.Reynolds, "File Transfer Protocol (FTP)," 1985. https://www.ietf.org/rfc/rfc959.txt [69] "Aspera," 2015. http://asperasoft.com/ [70] "Internet Protocol, Version 6 (IPV6) Specification," 2015. https://www.ietf.org/rfc/rfc2460.txt [71] "IPV6," 2015. http://en.wikipedia.org/wiki/IPv6 [72] "IPv4," 2015. http://en.wikipedia.org/wiki/IPv4 [73] Information Sciences Institute, "Internet Protocol (IPv4),". J.Postel, Ed. 1981. https://tools.ietf.org/html/rfc1349 [74] "Experimental Program to Stimulate Competitive Research (EPSCoR)," 2015. http://www.nsf.gov/od/iia/programs/epscor/index.jsp [75] Von Welch, Doug Pearson, Brian Tierney, and James Williams, "Security at the Cyberborder Workshop Report: Exploring the relationship of International Research network Connections and CyberSecurity," Mar.2012. [76] KC Claffy and Josh Polterock, "International Research Network Connections: Usage and Value Measurement,"May2013. [77] "XSEDE," 2013. https://www.xsede.org/ [78] "XDMoD: Comprehensive HPC System Management Tool,"2015. [79] "Advanced Scientific Computing Research Network Requirements," Office of Advanced Scientific Computing Research, DOE Office of Science, Energy Sciences Network,LBNL report LBNL-6109E, Oct.2012. [80] Willaim Johnston and Eli Dart, "ESnet capacity requirements: Projections based on historical traffic growth and on documented science requirements for future capacity," 2015. [81] "High Energy Physics and Nuclear Physics Network Requirements - Final Report,"LBNL 6642E, Aug.2013. [82] "BES (Basic Energy Sciences) Network Requirements Workshop, September 2010 - Final Report,"2010. 52 [83] "BER (Biological and Environmental Research) Network Requirements Workshop, April 2010 Final Report,"Apr.2010. [84] "Belle-II Experiment Network Requirements Workshop,"2012. [85] E.Dart, L.Rotman, B.Tierney, M.Hester, and J.Zurawski, "The Science DMZ: A network design pattern for data-intensive science," 2013. [86] "Cisco Visual Networking Index: Forecast and Methodology, 2013-2018," 2014. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generationnetwork/white_paper_c11-481360.pdf [87] A.Odlyzko, "Minnesota Internet Traffic Studies (MINTS) ," 2015. [88] A.Colby, "AT&T, NETC, Corning complete record-breaking fiber capacity test," 2009. http://news.soft32.com/att-nec-corning-complete-record-breaking-fiber-capacity-test_7372.html [89] "Annual Cisco Visual Networking Index Forecast Projects Global IP Traffic to Increase More Than Fourfold by 2014," 2010. http://newsroom.cisco.com/dlls/2010/prod_060210.html [90] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide, and Farnam Jahanian, "Internet inter-domain traffic," SIGCOMM Comput. Commun. Rev., vol. 40, no. 4, pp. 75-86, 2010. [91] "A Primer on the FITS Data Format," 2015. http://fits.gsfc.nasa.gov/fits_primer.html [92] "HDF5," 2015. http://www.hdfgroup.org/HDF5/ [93] "National Radio Astronomy Observatory (NRAO)," 2015. http://www.nrao.edu/ [94] "Structured Query Language (SQL)," 2013. http://en.wikipedia.org/wiki/SQL [95] "NoSQL," 2015. http://en.wikipedia.org/wiki/NoSQL [96] "Millenium Simulation Database," 2015. http://www.mpagarching.mpg.de/galform/millennium-II [97] "Bolshoi Simulation Multidark Database," 2015. http://www.multidark.org/MultiDark [98] "Dark Sky Simulations: Early Data Release," 2015. http://arxiv.org/abs/1407.2600 [99] "National Center for Supercomputing Applications (NCSA)," 2013. http://www.ncsa.illinois.edu/ [100] "High Performance Storage System (HPSS)," 2015. http://en.wikipedia.org/wiki/High_Performance_Storage_System [101] "Redundant Array of Inexpensive Disks (RAID)," 2015. http://en.wikipedia.org/wiki/RAID [102] "Blue Waters," 2015. http://www.ncsa.illinois.edu/enabling/bluewaters 53 [103] "Lustre File System," 2015. http://www.cse.buffalo.edu/faculty/tkosar/cse710/papers/lustrewhitepaper.pdf [104] "iPlant Collaborative," 2015. http://www.iplantcollaborative.org/ [105] "EarthCube," 2015. http://earthcube.org/ [106] "EarthCube Data Discovery Access and Mining (DDMA)," 2015. https://sites.google.com/site/earthcubeddma/ [107] "NASA Earth Science Data System Working Groups (ESDSWG)," 2015. https://earthdata.nasa.gov/esdswg [108] "International Council for Science : Committee on Data for Science and Technology (CODATA)," 2015. http://www.codata.org/ [109] "American Geophysical Union (AGU)," 2015. http://sites.agu.org/ [110] "American GeoPhysical Union Earth and Space Science Informatics Focus Group (AGU ESSI)," 2015. [111] "Open Geospatial Consortium Standards (OGC)," 2015. http://www.opengeospatial.org/standards [112] "ESIP Federated Open Search," 2015. http://wiki.esipfed.org/index.php/HowTo_Guide_for_Implementing_ESIP_Federated_Search_Servers [113] "Open-source Project for a Network Data Access Protocol (OPeNDAP)," 2015. http://www.opendap.org/ [114] "Federal Geographic Data Committee (FGDC) ISO 19115," 2015. http://www.fgdc.gov/metadata/geospatial-metadata-standards [115] "Network Common Data Format (NetCDF)," 2015. http://www.unidata.ucar.edu/software/netcdf/ [116] "Federation of Earth Science Information Partners (ESIP)," 2015. http://www.esipfed.org/ [117] "International Council for Science Integrated Research on Disaster Risk (IRDR)," 2015. http://www.icsu.org/what-we-do/interdisciplinary-bodies/irdr [118] "Global Lake Ecological Observatory Network (GLEON)," 2015. http://www.gleon.org/ [119] "Research Data Alliance (RDA)," 2015. https://rd-alliance.org/ [120] Fran Berman, Ross Wilkinson, and John Wood, "Building Global Infrastructure for Data Sharing and Exchange Through the Research Data Alliance,", 20(1/2) 2014. http://mirror.dlib.org/dlib/january14/01guest_editorial.print.html [121] "CoopEUS," 2015. http://www.coopeus.eu/ 54 [122] "Group On Earth Observations Global Earth Observation System of Systems (GEO/GEOSS)," 2015. http://www.earthobservations.org/geoss.php [123] M. L. Butterfield, J. S. Pearlman, and S. C. Vickroy, "A System-of-Systems Engineering GEOSS: Architectural Approach," Systems Journal, IEEE, vol. 2, no. 3, pp. 321-332, 2008. [124] Siri Jodha Singh Khalsa, Stefano Nativi, and Gary N.Geller, "The GEOSS Interoperability Process Pilot Project (IP3)," IEICE Transactions on Geoscience and Remote Sensing, vol. 47, no. 1 2009. [125] "International Council for Science (ICSU) World Data System," 2015. https://www.icsuwds.org/ [126] "Data Enabled Life Sciences Alliance (DELSA)," 2015. http://www.delsaglobal.org/ [127] "Facebook," 2015. http://www.facebook.com/ [128] "Instagram," 2015. http://instagram.com/ [129] "WhatsApp," 2015. https://www.whatsapp.com/ [130] "snapchat," 2015. https://www.snapchat.com/ [131] "Flickr," 2015. https://www.flickr.com/ [132] "YouTube," 2015. https://www.youtube.com/ [133] "Netflix," 2015. https://www.netflix.com/us/ [134] "Moore's Law," 2015. http://en.wikipedia.org/wiki/Moore%27s_law [135] "National Institute of Standards and Technology (NIST)," 2015. http://www.nist.gov/ [136] "NIST Big Data Program Working Group (NBD-PWG)," 2015. http://bigdatawg.nist.gov/usecases.php [137] U. C. a. R. S. NIST Big DataPublicWorking Group, "DRAFT NIST Big Data Interoperability Framework: Volume 3, Use Cases and Requirements," 2014. http://bigdatawg.nist.gov/_uploadfiles/BD_Vol3-UseCaseGenReqs_V1Draft_Pre-release.pdf [138] "twitter," 2015. https://twitter.com/?lang=en [139] "Gnip," 2015. https://gnip.com/ [140] "Google Glass," 2015. https://www.google.com/glass/start/ [141] "Samsung Galaxy S5," 2015. http://shop.sprint.com/mysprint/shop/phone_details.jsp?ensembleId=SPHG900BKS&flow=AAL &isDeeplinked=true&NoModal=true&defaultContractTerm=subsidy&ECID=SEM:Google:P:201 4_Q4_OEM:OEM_Samsung_Galaxy_S5_NonBrand:Galaxy_S5_Core_Exact:samsunggalaxys5: Exact&gclid=CIvmyrWykcMCFSgQ7Aod100AnQ&gclsrc=aw.ds 55 [142] "Apache Storm," 2015. https://storm.apache.org/ [143] Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chemyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mill, Paul Nordstrom, and Sam Whittle, "MillWheel: Fault-Tolerant Stream Processing at Internet Scale," in Very Large Data Bases 2013, pp. 734-746. [144] "Amazon Kinesis," 2015. [145] "A/B Testing," 2015. http://en.wikipedia.org/wiki/A/B_testing [146] "International Conference on Computing in High Energy and Nuclear Physics (CHEP)," 2015. http://chep2015.kek.jp/about.html [147] "ATLAS Experiment," 2015. http://atlas.ch/ [148] "Compact Muon Solenoid (CMS)," 2015. http://home.web.cern.ch/about/experiments/cms [149] "A Large Ion Collider Experiment (ALICE)," 2015. http://aliceinfo.cern.ch/ [150] "MONitoring Agents using a Large Integrated Services Architecture (MonaLisa)," 2015. http://monalisa.caltech.edu/monalisa.htm [151] "Daniel K. Inouye Solar Telescope (DKIST)," 2015. http://dkist.nso.edu/ [152] "Dark Energy Survey (DES)," 2015. http://www.darkenergysurvey.org/ 56 7. APPENDICES 57 7.1. APPENDIX A: INTERVIEW WITH NETWORK AND EXCHANGE POINT OPERATORS Discussion Topics for IRNC PIs, Network Engineers, Administrators, Operators Infrastructure: 1. 2. 3. 4. 5. 6. 7. 8. 9. Whichnetwork(s)doyousupport? Doyouhaveinfluenceonthedesign/architectureofthenetwork(s)yousupport? Whichnetworkstoyoupeerwith? Whatareyourpreferredpaths/networks? Whatisthecurrentnetworkcapacity? Doyouhaveplanstoincreasecapacity? Whatisyournetworkredundancyplan? Whatdoyouusetomonitoryournetwork? Arethereadditionalwaysyouwouldliketomonitoryournetwork? Science Community Interaction: 10. Whatdoyouviewasthecurrenttopthreedatadriversacrossyournetwork? 11. Whatwouldyouexpectthetopthreedatadriverstobein2020? 12. Whatisthecurrentaveragedataflowacrossyournetwork?Whatisthehighestyouhave seenitspiketo?Howoftenandforhowlongdospikesoccur? 13. Dolargedatasetstypicallyoriginateonsiteandgoout,ororiginateoffsiteandcomein?Or both?Doyouexpectthispatterntocontinue,reverse,orevenout? 14. Howdoremoteusersaccesslargedatasetsonyournetwork?(Whatarethesteps?) 15. Howoftendoyouinteractwithresearchers/scientistsregardingtheirnetworkneeds? 16. Wheninteractingwithresearchers/scientistsregardingtheirnetworkneeds,whatisthe mostusedmodeofcommunication?(Facetoface,email,phone,instantmessenger, videoconference,etc.?) 17. Arethepowerusersofyournetworkfromyourorganization/agency,orothers? Staff Support: 18. Howmanypeoplesupportthenetworkcurrently?Doyoufeelthecurrentstaffinglevels aresufficient? 19. Whatisthemostcommontroubleshootingissueyou/yourteamdealwithwhenlargedata setsaretraversingthenetwork,includingacrosstheWAN? 20. Doyouinteractwithnetworkengineersfromotheragencies/organizationsaboutthe network? 21. Whatcriticalskillssetswillfuturenetworkengineersneedtopossess? Future Expectations: 22. Forexchangepointoperators:Whatdoyouthinkabouteffectivenessoftheinternational exchangepointprogram? 58 23. Doyouenvisionanynetworktechnologiesresurfacingandgettingwidelydeployed?Ifso, pleaselistthemandexplainwhy. 24. Doyouenvisiondataneedsdrivinginternationalnetworkstobemoreorlessconcentrated ondedicatedend‐to‐endcircuits? 25. Whatcriteriadoyoufeelshouldbeusedtoprojectthefutureneedsofinternational networks? 26. Howdoyouthinkthiswillaffectstaffingneedsin2020? 27. Whatchangeswouldyouliketoseeacrossinternationalnetworksinthenextfewyears? Pleaselistanyareasthatapply:technical,administrative,governance,etc. 28. Whoneedstomakethesechanges? 59 7.2. APPENDIX B: ON-LINE SURVEY FOR NSF-FUNDED PIS International Collaborations in Science - Data Survey Introduction This survey is part of an National Science Foundation funded study (Award Number:1223688) examining growth in the amount and types of data moving across the NSF-funded international Research & Education network infrastructure. The purpose of this survey is to gather information regarding where data is being produced and stored for international science collaborations. The time period of interest is from the present through 2020. Your response will help inform capacity planning for the National Science Foundation’s International Research Network Connections (IRNC) program. Thereareonly5briefquestionstoanswer. This material does not necessarily reflect the views of the National Science Foundation. *Question 1: Do you participate in a project with international collaborators or using a science instrument located in another country? m j Yes l m j No l International Collaboration Project Information 60 Provide information about the collaboration or instrument, below. If you participate in more than one international collaboration, please describe the project having the largest amount of data. (please include whatever information you have, and leave the rest blank) Project Name Project URL Approximate number 61 International Collaborations in Science - Data Survey *For the collaboration project/instrument described above, please check ALL countries where the data is created, processed, or stored. (Note - this question is about the locations of the data, not the location of every collaborator. Multiple choice is allowed.) c e f * e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f Afghanistan Albania Algeria Andorra Angola Antigua & Deps Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize c Grenada e f * Guatemala e f c Guinea e f c Guinea-Bissau e f c Guyana e f c Haiti e f c Honduras e f c Hungary e f c Iceland e f c India e f c Indonesia e f c Iran e f c Iraq e f c Ireland {Republic} e f c Israel e f c Italy e f c Ivory Coast e f c Jamaica e f 62 Pakistan e f * Palau e f c Panama e f c Papua New Guinea e f c Paraguay e f c Peru e f c Philippines e f c Poland e f c Portugal e f c Qatar e f c Romania e f c Russian Federation e f c Rwanda e f c St Kitts & Nevis e f c St Lucia e f c Saint Vincent & the e f c Samoa e f c San Marino e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f Benin Bhutan Bolivia Bosnia Herzegovina Botswana Brazil Brunei Bulgaria Burkina Burundi Cambodia Cameroon c Japan e f c Sao Tome & Principe e f c Jordan e f c Saudi Arabia e f c Kazakhstan e f c Senegal e f c Kenya e f c Serbia e f c Kiribati e f c Seychelles e f c Korea North e f c Sierra Leone e f c Korea South e f c Singapore e f c Kosovo e f c Slovakia e f c Kuwait e f c Slovenia e f c Kyrgyzstan e f c Solomon Islands e f c Laos e f c Somalia e f c Latvia e f c South Africa e f 63 International Collaborations in Science - Data Survey c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f Canada Cape Verde Central African Rep Chad Chile China Colombia Comoros Congo Congo {Democratic Costa Rica Croatia Cuba Cyprus Czech Republic Denmark Djibouti Dominica Dominican Republic East Timor Ecuador Egypt c Lebanon e f c South Sudan e f c Lesotho e f c Spain e f c Liberia e f c Sri Lanka e f c Libya e f c Sudan e f c Liechtenstein e f c Suriname e f c Lithuania e f c Swaziland e f c Luxembourg e f c Sweden e f c Macedonia e f c Switzerland e f c Madagascar e f c Syria e f c Malawi e f c Taiwan e f c Malaysia e f c Tajikistan e f c Maldives e f c Tanzania e f c Mali e f c Thailand e f c Malta e f c Togo e f c Marshall Islands e f c Tonga e f c Mauritania e f c Trinidad & Tobago e f c Mauritius e f c Tunisia e f c Mexico e f c Turkey e f c Micronesia e f c Turkmenistan e f c Moldova e f c Tuvalu e f c Monaco e f c Uganda e f c Mongolia e f c Ukraine e f 64 c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c e f c Montenegro e f c United Arab Emirates e f c Morocco e f c United Kingdom e f c Mozambique e f c United States e f c Myanmar, {Burma} e f c Uruguay e f c Namibia e f c Uzbekistan e f c Nauru e f c Vanuatu e f c Nepal e f c Vatican City e f c Netherlands e f c Venezuela e f c New Zealand e f c Vietnam e f c Nicaragua e f c Yemen e f Georgia fc Germany e f Niger c e c Nigeria fe f Zambia c e c Zimbabwe fe c Ghana fe c Norway fe c Greece fe c Oman fe El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland France Gabon Gambia 65 New Instruments *Question 2: Do your research plans involve the use of any new, large-scale scientific instrumentation that will begin or complete construction by the year 2020? m j Yes l m j No l New Instrument Information Tell us about the new instrument. (Please include the information you know, and leave the rest blank) Name of instrument or project building the instrument URL describing the instrument 66 International Collaborations in Science - Data Survey *Please check ALL countries where the instrument will be located. (same as previous list of countries as Q1) Other Impact Factors *Question 3: Do you know of any activities or technology innovations that will significantly scale up the amount of data being collected, created or stored for use across international boundaries by the year 2020? (for example: dramatic reduction in cost of producing data, new centralization of previously scattered data, or other drivers) m j Yes l m j No l If Yes please describe the activity/innovation that will scale up the amount of data being produced _________________________________________________________________________________________________________ Barriers to Collaboration 67 *Question 4: Are there currently any countries where Internet access is so limited that you are having difficulties collaborating with others? m j Yes l m j No l International Collaborations in Science - Data Survey *Please select one or more countries where improved network connectivity would significantly impact your collaboration plans though 2020. (same as previous list of countries as Q1) 68 *Question 5: Please indicate your primary science knowledge domain m j Math and Physical Sciences l m j Biological Sciences l m j GeoScience l m j Computer & Information Science l m j Social, Behavioral and Economic Science m l l j Education & Human Resource Development m l j Engineering 69 7.3. APPENDIX C: SUMMARY OF RESPONSES TO NSF PI SURVEY An on-line survey of NSF funded Principal Investigators whose projects’ initial funding commended in years 2009-2014 was conducted in July and August 2014. A total of 4,050 PIs responded, representing a response rate of 13%. QUESTION 1: Do you participate in a project with international collaborators or using a science instrument located in another country? Yes – 56% No - 44% Those who answered “No” were immediately taken to the end of the survey. The remaining questions were answered by the 2227 people who answered “Yes”. Provide information about the collaboration or instrument, below. If you participate in more than one international collaboration, please describe the project having the largest amount of data. This question was answered by 1208 responders, with each listing only one project. Data collected included project name, URL, and approximate number of collaborators. Projects having 100 or more participants are included in the "Catalog of International Big Data Science Programs" web site at http://irnc.clemson.edu/ For the collaboration project/instrument described above, please check ALL countries where the data are created, processed, or stored. (Note - this question is about the locations of the data, not the location of every collaborator. Multiple choice is allowed.) 179 countries were mentioned by 1667 responders; respondents could select as many locations as they wanted to. The Table on the next page below shows countries where data are created, processed or stored mentioned by at 50 or more respondents. 70 Country Count Country Count Country Count United States 606 Brazil 127 New Zealand 72 Germany 330 Spain 125 Mexico 71 United Kingdom 321 India 114 Russian Federation 71 France 270 Netherlands 103 Denmark 67 China 192 Switzerland 92 South Africa 62 Canada 177 Korea South 87 Argentina 58 Japan 170 Israel 83 Belgium 58 Australia 156 Sweden 77 Finland 58 Italy 139 Chile 72 Poland 57 QUESTION 2: Do your research plans involve the use of any new, large-scale scientific instrumentation that will begin or complete construction by the year 2020? This question was answered by 3393 respondents. 12% responded with YES, 88% responded with NO. When asked for the name/URL of the new instrument, 237 persons responded. These responses are included in the "Catalog of International Big Data Science Programs" web site at http://irnc.clemson.edu/ When asked to indicate in which country the new instrument would be located, 381 persons responded, listing 89 countries. Countries mentioned by 10 or more respondents are listed below: Response Count 237 38 30 29 25 22 21 17 17 16 15 12 Answer Options United States Chile Germany France United Kingdom Japan China Australia Switzerland Canada Italy India 71 QUESTION 3: Do you know of any activities or technology innovations that will significantly scale up the amount of data being collected, created or stored for use across international boundaries by the year 2020? (for example: dramatic reduction in cost of producing data, new centralization of previously scattered data, or other drivers) This question was answered by 3286 respondents. Twenty-two percent (723 people) responded YES, and 707 provided some supporting text. The responses are summarized by primary scientific discipline as WORDLs, below. FIGURE23.MATH&PHYSICALSCIENCES 72 FIGURE24.BIOLOGICALSCIENCES FIGURE25.GEOSCIENCES 73 FIGURE26.COMPUTERSCIENCES FIGURE27.SOCIALANDBEHAVIORALSCIENCES 74 FIGURE28.EDUCATION,OUTREACH,ANDHUMANDEVELOPMENT QUESTION 4: Are there currently any countries where Internet access is so limited that you are having difficulties collaborating with others? A total of 3,272 persons responded, with 12% saying YES. Countries mentioned by 10 or more persons are listed below. Answer Options Response Count Answer Options Response Count China 62 Mexico 14 United States 29 South Africa 14 India 26 Uganda 14 Ghana 24 Kenya 19 Congo {Democratic Rep} 13 Ethiopia 18 Iran 12 Tanzania 18 Nigeria 17 Russian Federation 12 Brazil 14 Zimbabwe 12 Cameroon 10 75 QUESTION 5: Respondent’s Primary Science Domain (for people who answered yes to international collaboration) This question was answered by 3,253 persons, with distribution by science domain shown below. Response Count Math and Physical Sciences Biological Sciences Social, Behavioral and Economic Science Computer & Information Science GeoScience Engineering Education & Human Resource Development 0 76 200 400 600 800 1000 1200 7.4. APPENDIX D: LIST OF PERSONS INTERVIEWED Celeste Anderson, Director, External Networking Group, University of Southern Califormia, , August 22, 2013 Jaqueline Brown, Associate Vice President for Information Technology Partnerships at the University of Washington and Executive Director for International Partnerships for the Pacific Northwest Gigapop/Pacific Wave,, August 2013 Maxine Brown, Associate director of the Electronic Visualization Laboratory (EVL) at the University of Illinois at Chicago (UIC). August 13, 2013. K-C Claffy, is principal investigator for the distributed Cooperative Association for Internet Data Analysis (CAIDA), and resident research scientist based at the University of California's San Diego Supercomputer Center. Andy Connolly, Professor of Astronomy, University of Washington. (no date) Julio Ibarra, Florida International University and AmLight Network, Jan. 22, 2014. David Lassner, President, University of Hawaii Tom Lehman, Director of Research, MidAtlantic Crossroads (MAX), August 14, 2013 Joe Mambretti, Director of the International Center for Advanced Internet Research at Northwestern University, August 22, 2013. Larry Mays, Chair, Department of BioInformatics, University of North CarolinaCharlotte, April 2013. Josh Polterock, Manager of Scientific Projects, Center for Applied Internet Data Analysis (CAIDA). Tripti Sanha, MidAtlantic Crossroads (MAX), August 14, 2013 Dan Sturman, Ph.D., Engineering Director, Google. October 2013 John Silvester, Executive Director of Center for Scholarly Technology. Professor or Computer Engineering, USC, August 2013 77 7.5. APPENDIX E: REPORTS USED FOR THIS STUDY National Science Foundation (NSF) United States Antarctica Program (USAP) Science Workshop, 27 February 2012. NASA Program Division (NPD) Civil and Commercial Operations (CCO) International Big Science List, Jim Williams, Di-Lu, and Ed Moynihan Cisco Visual Networking Index: 2013-2018 NIST Big Data Interoperability Framework: Volume 3, Use Cases and General Requirements. Draft Version1, April 2014, http://dx.doi.org/10.6028/ NIST.SP.XXX SECURITY AT THE CYBER BORDER: EXPLORING CYBERSECURITY FOR INTERNATIONAL RESEARCH NETWORK CONNECTIONS. Welch, Von; Pearson, Douglas; Tierney, Brian; Williams, James K.C. Claffy and J. Polterock, “International Research Network Connections: Usage and Value Measurement'', Tech. rep., Cooperative Association for Internet Data Analysis (CAIDA), May 2013. 7.6. APPENDIX F: SCIENTIFIC AND NETWORK COMMUNITY MEETINGS ATTENDED FOR REPORT INPUT TERENA 2012 (Reykjavik, Iceland): Attended conference and met with several network engineers, exchange point operators, international R&E network operators, network researchers. CHEP 2012 (New York, New York): Attended conference and met with HEP (high energy physics) researchers. Organized a short BOF to introduce researchers to this project and request information for this report. Internet2 Meetings (Joint Techs, TIP 2013, Member Meeting – various US locations): Attended meetings to learn more about projects involving international collaborations and large datasets. Interviewed IRNC PI’s, network operators. CANS 2012 (Seattle, WA): Met researchers, network operators, exchange point operators, researchers working on collaborations between the US and China. GLIF 2012 (Chicago, IL): Met with IRNC PIs in attendance, other operators and researchers working on data-intensive international collaborations. Noted upcoming network upgrades to current international connections and applications driving upgrades. Introductions to additional POC for additional related projects. Various site visits/interviews: Indiana University (ACE/TP3), UCSD (CAIDA.org, various researchers in HEP, Oceanography), SDSC (network operators, HPC specialists/scientists), NCAR (network operators, Climatology researchers), ESnet (Network engineers dealing with tuning and data-intensive networking projects), Starlight operators, CENIC operators, PNWGP operators. 78