Report - Catalog of International Big Data Science Programs

Transcription

Report - Catalog of International Big Data Science Programs
Report on International Data Exchange Requirements
Authors: Jill Gemmill
Geoffrey Fox, Stephen Goff, Sara Graves, Mike L. Norman, Beth Plale, Brian Tierney
Report Advisory Committee
Jim Bottum, Clemson University
Bill Clebsch Stanford University
Cliff Jacobs, Cliff Jacobs LLC
Geoffrey Fox, Indiana University
Stephen Goff, University of Arizona
Sara Graves, University of Alabama-Huntsville
Steve Huter, University of Oregon
Miron Livny, University of Wisconsin
Marla Meehl, UCAR
Ed Moynihan, Internet2
Mike Norman, San Diego Supercomputer Center
Beth Plale, Indiana University
Brian Tierney, ESnet
NSF Support ACI-1223688
December 2014
1
1. EXECUTIVE SUMMARY
The National Science Foundation (NSF) contributes to the support of U.S. international science
policy and provides core cyberinfrastructure through the International Research Network
Connections (IRNC) program. The annual investment in 2014 is estimated at $7M/year and
provides international network services supporting U.S. research, scholarship and collaboration.
Given the rapid growth of data driven science, new instrumentation, and multi-disciplinary
collaboration, NSF funded a study to estimate the flow of data over the international networks in
the year 2020. This estimation will inform the likely scale of network requirements and the level
of investment needed to support international network requirements for data exchange in the next
five years.
FIGURE1.INTERACTIVEWEBSITEATHTTP://IRNC.CLEMSON.EDU
Methods used to gather information to construct a likely scenario of the future needs included in
person interviews with IRNC providers, participation in science domain conferences, review of
available network utilization measurements, comparison to a commodity Internet prediction study,
2
survey of NSF funded principal investigators, input from science domain experts, on-line search,
and advice from the study advisory committee.
In addition to this written report, the study also produced the “Catalog of International Big Data
Science”, an interactive and updatable website available at http://irnc.clemson.edu (Figure 1).
1.1. KEY OBSERVATIONS
1) The IRNC networks are demonstrably critical for scientific research, education, and
collaboration.

NSF’s $7M annual investment is highly leveraged by a factor of 10-15 due to
contributions of other nations to the global R&E network infrastructure. (Section 3)

The IRNC networks are distinguished from the commodity Internet in having an
extremely large data driven traffic flow; the commodity Internet is 79-90% video
driven. (Section 5.3.2)

56% of NSF funded scientists (all disciplines funded by NSF) participate in
international collaborations. (Appendix C, Question 1)

IRNC network capacity (bandwidth available) has been keeping pace with national
R&E backbone speeds and scientific requirements. (Sections 5.1.1 and 5.3.2)

IRNC has demonstrated throughput (sustained gigabits per second) and quality of
service for certain applications that far exceed what is possible on the commodity
Internet. (Sections 3.2 and 3.3)
2) IRNC traffic in 2014 will triple by 2018.

From 2009 to 2013, IRNC traffic is estimated to have grown by a factor of 5.7. This
growth is similar to or slightly higher than the growth of the global Internet over the
same period of time. (Figure 11 and Section 5.3.2)
o The growth rate beyond 2018 may be even greater as “the Internet of Things”
(eg: sensor networks) develops. The number of devices connected to IP
networks will be twice as high as the global population by 2018, accompanied
by a growing number of machine to machine (M2M) applications. (Section
5.3.3)
o A review of past internet traffic data as well as technology development in
general indicates that the growth trend has been and will remain exponential.
(Section 5.3.3)

Known scientific data drivers for the IRNC in 2020 will include:
o Single site instruments serving: (Sections 5.1.3 and 5.3.1)
a) the globe’s astronomers and astrophysicists, such as The Large Synoptic
Survey Telescope (LSST) [1], Attacama Largemillimeter/submillimeter
3
Array (ALMA) [2], Square Kilometer Array (SKA) [3], and James Web
Space Telescope (JWST) [4];
b) the globe’s physicists, such as the Large Hadron Collider (LHC [5]),
International Thermonuclear Experimental Reactor (ITER) [6], and Belle
detector collaborations [7].
o Thousands of highly distributed genomics and bioinformatics sites, including:
(Sections 5.4.2)
a) high throughput sequencing;
b) medical images;
c) integrated Light Detection and Ranging (LIDAR) [8].
d) Each site will produce as much data as a single site telescope or
detector. As these communities gain expertise in creating, analyzing
and sharing such data, the number of extremely large data sets
transiting networks will increase by a 103 order of magnitude.
o Climate data, aggregated data sources (including sensor networks) and
bottom-up collaborations will drive increased data for the global geoscience
community. (Sections 5.3.1 and 5.4.3)
o CISE researchers will be working with petabyte sized research data sets to
explore improved search/recommendation algorithms, deep learning and
social media, machine to machine (M2M) applications, military and heath
applications. (Section 5.4.4)
3) The IRNC program benefits from a collaborative set of network providers, but could use
better organization to maximize these benefits. (Section 5.1.5)

A strength of this approach is the ability to try multiple approaches to a problem at the
same time and develop solutions that cross boundaries.

Limitations of this approach are added complexity in the absence of a central operating
center (NOC), inconsistent reporting on activities and network measurements, and
absence of global reports.
4) There is limited network monitoring and measurement available among IRNC providers,
which makes it very difficult to assess link utilization beyond total bandwidth used.
(Section 5.2.2)

There is a high interest at NSF and among science domain researchers, among others,
in use of the network by discipline, by country of origin or destination, or by type of
application. However, such data is in general not readily available.

A perceived host of legal, political, and cultural issues make it difficult to address the
lack of monitoring. To date that discussion has been held mostly among network
providers.
4
5) Most end-to-end performance issues, for IRNC and high performance R&E networks
(ESnet, Internet2 (I2) Regional Optical Networks (RONs), etc. are due to problems in the
campus network, building, or end station equipment. (Section 5.1.5 and 5.3.1)
6) Many EPSCoR jurisdictions have fallen behind in their participation in international
scientific data exchange.

In 2009, EPSCoR jurisdictions had traffic on international R&E links that was
comparable to many other regions within the U.S. Current utilization by EPSCoR
jurisdictions is noticeably lower, reflecting uneven continued investment in regional
and campus infrastructures. (Section 5.2.1 and Figure 14)
7) The impact on IRNC activities of a trend towards cloud services and data centers as large
content providers is unknown.

In the global Internet, traffic patterns have shifted from a hierarchical pattern where big
providers connect national and regional networks to a pattern of direct connections
between large content providers, data centers, and consumer networks. The impact of
this transition on R&E networks is unknown. (Section 5.3.2)
1.2. RECOMMENDATIONS
The purpose of this report is to assist NSF in predicting the amount of scientific data that will be
moving across international R&E networks by 2020, and also to discover special characteristics of
that data such as time sensitivity or location. In addition, the study was to develop a method for
conducting this analysis that could be repeated in the future. The key findings, listed in section
1.1, show that there will be a continued exponential increase in international scientific data over
the next five years. The recommendations below are “low hanging fruit” that, if followed, will
best capture the opportunities and mitigate the current and future challenges of operating
international R&E networks supporting data-driven science.
1) Establish a consulting service or clearing house for information on the IRNC.
 The key service would be to facilitate discussions between scientists and network
engineers regarding the characteristics and requirements of their data. The Department
of Energy (DoE) and NSF Polar programs do this for their larger science programs. This
approach could build a bridge to increase scientific productivity.
 This service could be a function of the new IRNC Network Operations Center (NOC)
called for in NSF RFP-14-554. Alternatively, this service could possibly be supported
by making experts available on retainer to those who need assistance.
 For large NSF programs, once past the pre-proposal selection, NSF could assign this
assistance to help at least large and medium scale science projects understand and plan
for their international network capacity and impact on their requirements.
5
 Domain specific workshops that include scientists, campus network staff, and backbone
provider staff could be held to dig into application requirements details and learn from
success stories; some of these are expected to result from the NSF CC*NIE and
CI*Engineer program awards.
2) Establish a single Network Operations Center for US International network providers so
that users and regional operators have a single place to contact.
 This service is likely to be a function of the new IRNC Network Operations Center
(NOC) called for in NSF RFP-14-554.
 This service would be a central point of contact for campus, regional, and national R&E
network operators and staff to contact international R&E networks, both in the U.S. and
elsewhere, regarding troubleshooting, special requirements, and other matters relevant to
optimal end-to-end connections.
 This service would be report on the status of all international R&E links, such as
up/down, current load, and service announcements.
 This service could provide uniform and comprehensive reporting on network traffic.
3) Establish global and uniform network measurement and reporting among IRNC
providers, including more detailed utilization information and historical reporting.
 Move the dialog on this topic out of a network operators-only context.
 Establish/adopt a standard meta-data description for network traffic (eg: the schema
developed for the GLORIAD Insight [9] monitoring system) to enable IRNC-wide
reporting and achieve common reporting to the extent that policy allows.
 Implement the measurement recommendations made at two or more IRNC network
monitoring meetings, ie: begin with passive Domain Name Service (DNS) [10;11] record
reporting. Accessible packet loss reports are also of high interest.
4) Continue to support collaborative coordination among network providers, within the US
and with external network partners. Foster organizations that build the community
working together across international boundaries.

Successful examples include the Global Lambda Integrated Facility (GLIF) [12] that
works together to develop an international optical network infrastructure for science by
identifying equipment, connection requirements, and necessary engineering functions
and services. Another example is the R&E Open Exchange Points that support bi-lateral
peering at all layers of the network.
5) Increase outreach and training to campus network staff in topics such as Border Gateway
Protocol (BGP) [13;14], Software Defined Networks (SDN) [15] , wide-area-networking,
6
how to debug last 100 feet issues, and how to talk with faculty about their application
requirements.
6) Address the uneven development of cyberinfrastructure; it is a barrier to collaboration.

ACI and scientists in EPSCoR jurisdictions should work with the EPSCoR program to
address the growing network inequality gap

Continue the Network Startup Resource Center that focuses on training for network
operators in countries whose IP traffic will grow most rapidly from now to 2020 – the
Middle East and Africa.
7) Focus on the following in engineering IRNC networks:
 Continue to facilitate the transfer for extremely large data sets/streams; an international
drop-box may be useful.
 Continue to push the envelope in supporting bi-directional audio/video at the highest
resolutions.
 Prepare for the “Internet of Things”, extreme quantities of relatively small data
transmissions (eg: social media, sensors) that may have delivery delay requirements
 Address “busy hour” traffic patterns, where average usage increases by 10-15%.
7
2. TABLE OF CONTENTS
1. EXECUTIVE SUMMARY .............................................................................................. 2 1.1. Key Observations ...................................................................................................................... 3 1.2. Recommendations .................................................................................................................... 5 2. TABLE OF CONTENTS ................................................................................................. 8 3. INTRODUCTION TO THE IRNC NETWORK PROGRAM ................................... 10 3.1. The IRNC Production Networks ........................................................................................... 10 3.1.1. ACE (America Connects to Europe) ................................................................................................. 11 3.1.2. AmLight (Americas Lightpaths) ....................................................................................................... 12 3.1.3. GLORIAD (Global Ring Network for Advanced Applications Development)................................. 13 3.1.4. TransLight/Pacific Wave (TL/PW) ................................................................................................... 13 3.1.5. TransPAC3 ........................................................................................................................................ 14 4. 5. 3.2. The IRNC Experimental Network ......................................................................................... 14 3.3. Current IRNC Capacity and Infrastructure.......................................................................... 15 3.4. IRNC Exchange Points .......................................................................................................... 16 3.5. Emerging Networks................................................................................................................ 16 3.6. Emerging Network technologies............................................................................................ 16 3.7. Additional networks for International Science ..................................................................... 17 METHODS ...................................................................................................................... 17 4.1. Survey of Network Providers ................................................................................................. 18 4.2. Available Network Measurements ......................................................................................... 18 4.3. IRNC Utilization Compared to ESnet and Global Internet Traffic ..................................... 19 4.4. Data Trends by Science Domain ........................................................................................... 19 4.5. Catalog of International Big Data Science Programs .......................................................... 19 FINDINGS ....................................................................................................................... 20 5.1. Findings: SURVEY OF Network providers ......................................................................... 20 5.1.1. Current IRNC Infrastructure and Capacity ........................................................................................ 20 5.1.2. Current Top Application Drivers....................................................................................................... 20 5.1.3. Expected 2020 Application Drivers .................................................................................................. 21 8
5.1.4. Interaction of Network Operators with Researchers.......................................................................... 21 5.1.5. Current Challenges ............................................................................................................................ 21 5.1.6. What are Future Needs of International Networks? .......................................................................... 22 5.1.7. Exchange Point Program ................................................................................................................... 23 5.2. FINDINGS: Network measurement...................................................................................... 23 5.2.1. GLORIAD’s Insight Monitoring System ..........................................................................................25 5.2.2. The Angst over measurement and data sharing ................................................................................. 27 5.3. IRNC Network Traffic Compared to ESnet and the GLobal Internet ................................. 28 5.3.1. Synopsis of data trends for the ESnet International Networks .......................................................... 28 5.3.2. Global Internet Growth Rate ............................................................................................................. 30 5.3.3. Industry Internet Growth Forecast for 2013-2018 ............................................................................. 32 5.4. Findings: Data Trends .......................................................................................................... 33 5.4.1. Synopsis of Data Trends in Astronomy and Astrophysics ................................................................ 33 5.4.2. Data Trends in Bioinformatics and Genomics...................................................................................36 5.4.3. Data Trends in Earth, Ocean, and Space Sciences ............................................................................ 38 5.4.4. Data Trends in Computer Science ..................................................................................................... 43 5.5. Findings: Online Catalog of International Big Data Science Programs ........................... 44 5.5.1. Large Hadron Collider : 15 PB/year (CERN)................................................................................... 44 1.1.1 The Daniel K. Inouye Solar Telescope (DKIST): 15 PB/year........................................................... 45 1.1.2 Dark Energy Survey (DES): (1 GB image, 400 images per night, instrument steering). .................. 46 1.1.3 Large Square Kilometer Array (Australia & South Africa) (100 PB per day) .................................. 46 5.6. Findings: Survey of NSF-funded PIs................................................................................... 47 6. REFERENCES................................................................................................................ 49 7. APPENDICES ................................................................................................................. 57 7.1. Appendix A: Interview with Network and Exchange Point Operators ............................... 58 7.2. Appendix B: On-Line Survey for NSF-funded PIs.............................................................. 60 7.3. Appendix C: Summary of Responses to NSF PI Survey ..................................................... 70 7.4. Appendix D: List of Persons Interviewed ............................................................................. 77 7.5. Appendix E: Reports used for this study ............................................................................. 78 7.6. Appendix F: Scientific and Network Community Meetings attended for report input ...... 78 9
3. INTRODUCTION TO THE IRNC NETWORK PROGRAM
The International Research Network Connections (IRNC) network providers implement Research
and Education (R&E) networks that implement policies and network operational procedures that
are driven by the needs of international research and education programs. IRNC leverages existing
commercial telecommunications providers’ investments in undersea communication cables and
university and Regional Optical Network (RON) expertise in operating regional and national R&E
networks. The U.S., through the NSF, invests approximately $7M/year in the IRNC program; this
modest investment is highly leveraged by a factor of 10 to 15 via international partner investments
supporting international R&E network links. IRNC networks are open to and used by the entire
U.S. research and education community and operate invisibly to the vast majority of users.
The IRNC networks support unique scientific and education application requirements that are not
met by services across commercial backbones. In addition, the IRNC network providers are
closely connected to researcher’s needs and requirements and are attuned to meeting their needs
as a primary motivator. Examples of such requirements include hybrid network services; low
latency and real time services, and end-to-end performance management. In this regard, the IRNC
extends the fabric of campus, regional, and national R&E networks across oceans and continents,
a web of connections built in collaboration with international partner R&E networks that serve the
growing number of international scientific collaborations.
The IRNC program was most recently funded for years 2009-2014; NSF is currently reviewing
responses to solicitation 14-554. The new awards will continue to provide production network
connections and services to link U.S. research networks with peer networks in other parts of the
world and leverage existing international network connectivity; support U.S. infrastructure and
innovation of open network exchange points; provide a centralized facility for R&E Network
Operation Centers (NOC) operations and innovation that will drive state-of-the-art capabilities;
stimulate the development, application and use of advanced network measurement capabilities and
services across international network paths; and support global R&E network engineering
community engagement and coordination.
3.1. THE IRNC PRODUCTION NETWORKS
IRNC network providers acquire, manage and operate network transport facilities across
international boundaries for shared scientific use. Network providers make arrangements with
owners of optical fiber, including undersea fiber cables, to use some portion of this installed
infrastructure, using equipment and management practices dedicated to R&E traffic. All shared
R&E international networks are funded by the NSF in cooperation with the governments of other
countries.
The independently managed networks exchange traffic at Exchange Points; there, operators can
implement bi-lateral policies to receive traffic from other networks and send traffic to other
networks; this includes passing traffic thru network B so that network A can reach network C.
10
FIGURE2.MAPOFTHEIRNCNETWORKS2014
Policies can derive from human policy, current traffic conditions, current costs, and so forth.
Exchange points are the focus for policy and technical coherence.
A map of the IRNC networks 2009-2014 is shown in Figure 2 (map from the Center for Applied
Data Analysis (CAIDA)[16]). A current limitation of this overview map, and some of the
following regional maps, is that they rely on manual updates to static files so maps are therefore
not likely to be current. Networks shown include five production networks: ACE; AmLight;
GLORIAD; TransPAC3 and PacificWave, and one experimental network (TransLight). 3.1.1. ACE (America Connects to Europe)
FIGURE3.AMERICACONNECTSTOEUROPENETWORKMAP 11
ACE (NSF Award #0962973) [17] is led by Jennifer Schopf of Indiana University, in partnership
with Delivery of Advanced Network Technology to Europe (DANTE) [18], Trans-European
Research and Education Networking Association (TERENA) [19], New York State Education and
Research Network (NYSERNet) [20], and Internet2 (I2) [21]. This project connects a community
of more than 34 national R&E networks in Europe.
3.1.2. AmLight (Americas Lightpaths)
AmLight (NSF Award #0963053) [22] is led by Julio Ibarra of Florida International University.
This program ties together the major research networks of Canada, Brazil, Chile, Mexico, and the
United States. In addition, this work enables interconnects between the United States and the Latin
American Cooperation of Advanced Networks (RedCLARA) [23] that connects eighteen Latin
FIGURE4.AMERICASLIGHTPATHNETWORKMAP
12
American national R&E networks. The Atlantic Wave and Pacific Wave Exchange Points provide
peering for the North American backbone networks I2, the U.S. Department of Energy’s Energy
Sciences Network (ESnet) [24], and Canada's Advanced Research and Innovation Network
(CANARIE) [25].
3.1.3. GLORIAD (Global Ring Network for Advanced Applications Development)
GLORIAD (NSF Award #0441102) [26] is led by Greg Cole at the University of Tennessee,
Knoxville. It includes cooperative partnerships with partners in Russia (Kurchatov Institute) [27],
Korea Institute of Science and Technology Information (KISTI) [28], China (Chinese Academy
of Sciences) [29], the Netherlands (SURFnet) [30], the Nordic countries (NORDUnet [31] and
IceLink [32]), Canada (CANARIE), and the Hong Kong Open Exchange Portal (HKOEP) [33].
In addition, new partnerships are being developed with Egypt (ENSTINet [34] and Telecomm
Egypt [35]), India (Tata Communications [36] and the National Knowledge Network) [37],
Singapore (SingAREN [38]), and Vietnam (VinAREN [39]).
FIGURE5.GLORIADNETWORKMAP
3.1.4. TransLight/Pacific Wave (TL/PW)
FIGURE6.TRANSLIGHT/PACIFICWAVENETWORKMAP
13
TL/PW (NSF Award #0962931) [40] is led by David Lassner, University of Hawaii. TL/PW
presents a unified connectivity face toward the West for all U.S. R&E networks including I2 and
Federal agency networks, enabling general and specific peerings with more than 15 international
R&E links. This project not only provides a connection for Australia's R&E networking
community but also provides connectivity for the world's premiere setting for astronomical
observatories, the summit of Mauna Kea on the Big Island of Hawaii. The Mauna Kea
observatories comprise over $1 billion of international investment by 13 countries in some of the
most important cyberinfrastructure resources in the world.
3.1.5. TransPAC3
TransPAC3 (NSF Award #0962968) [41] is led by Jen Schopf at Indiana University. The R&E
networks included in the TransPAC3 collaboration cover all of Asia excluding only North Korea,
Brunei, Myanmar, and Mongolia. TransPAC3 collaborates with the Asia Pacific Advanced
Network (APAN) [42], DANTE , Internet2 and other R&E networks.
FIGURE7.TRANSPAC3NETWORKMAP
3.2. THE IRNC EXPERIMENTAL NETWORK
TransLight/Starlight (NSF Award #0962997) [43] is led by Tom DeFanti at the University of
California, San Diego. The award provides two connections between the U.S. and Europe for
production science: a routed connection that connects the pan-European GEANT2 to the U.S. I2
and ESnet networks, and a switched connection that is part of the LambdaGrid fabric being created
by participants of the GLIF. StarLight is a multi-100Gb/s exchange facility, peering with 130
separate R&E networks.
This network is unique among IRNC networks in that it is entirely dedicated to research traffic,
and carries no educational commodity type traffic (email, web pages,etc.). Translight uses optical
networking, a means of communication that uses signals encoded onto light, that can operate at
distances from local and transoceanic and is capable of extremely high bandwidth. Optical
networking can be used to partition optical fiber segments so that traffic is entirely segregated,
with different policies or techniques applied to each segment. In collaboration with GLIF partners,
Translight has been able to provide high bandwidth, low latency performance for interactive highdefinition visualization and other types of demanding applications. A limitation is that scheduling
is required.
14
3.3. CURRENT IRNC CAPACITY AND INFRASTRUCTURE
To provide context for the description of IRNC capacity, some background information on
network engineering is helpful. Traditional Internet networking is based on the TCP/IP protocol
suite [44]. TCP/IP is designed to be a best-effort delivery service; there may be significant
variation in the amount of time it takes a data packet to be delivered and in the amount of delay
between packets. Greater congestion in the network results in greater performance variation,
including the possibility of delivery failure, especially in the case of very large files. As a general
practice, internet providers achieve the desired network performance by arranging for an
abundance of bandwidth; a network that operates at 50% capacity is considered well engineered
since it has “headroom” to accommodate any sudden bursts of traffic. TCP/IP has built-in
congestion algorithms that provide equitable use of the bandwidth based on current traffic
conditions; this means end-users can use the Internet at their convenience without scheduling or
“busy signals”. A consequence of this approach is that measures of throughput, latency and jitter
for identical data traveling the same geographic path can vary significantly due to other traffic on
the network when the measurement takes place.
The IRNC production networks have been engineered to provide an abundance of bandwidth. The
IRNC experimental network, in contrast, is engineered to deliberately use 100% of bandwidth,
continuously; this design is possible because this network allows only pre-authorized traffic and
makes direct use of the underlying optical network. Via optical networking, the StarWave
experimental network can provide single 100Gbs sustained transfers over long periods of time, as
well as to multiple sustained 1 Gbs/10Gbs flows. End-to-end network configuration and
scheduling is now accomplished in an automated manner.
In 2014, all networks except TransLight have at least one 100Gbps network path; the exception is
due to the high cost of fiber crossing the Pacific Ocean (a factor of 5 higher than the Atlantic).
TransLight provides 40Gbps total bandwidth to Australia and elsewhere in Asia. In 2014, a new
40Gbps direct route to New Zealand was accomplished. The Pacific routes are expected to be
upgraded to 100Gbps in 20161. All networks have redundant paths across oceans, except for the
trans-Pacific connection due to cost.
These performance numbers compare to regional and national backbone speeds, and to the I2
Innovation Platform. Campuses/facilities that have joined the I2 Science DMZ (“Demilitarized
Zone”2) [45] have access to a SDN enabled, firewall-free, 100Gbps network path that allows them
to experience the highest level of end-to-end performance. Thus, network performance across
oceans and/or continents should ideally be impacted mostly by the distance involved and not by
network bottlenecks in the path.
1ConversationwithDavidLassner,September2014
2DevelopedbyESnetengineers,theScienceDMZmodeladdressescommonnetworkperformanceproblems
encounteredatresearchinstitutionsbycreatinganenvironmentthatistailoredtotheneedsofhigh
performancescienceapplications,includinghigh‐volumebulkdatatransfer,remoteexperimentcontrol,and
datavisualization.
15
3.4. IRNC EXCHANGE POINTS
Network exchange points for research and education flows have served a pivotal role over the last
20 years in extending network connectivity internationally, providing regional R&E networking
leadership, and supporting experimental networking. Through years of operational experience
combined with international peering relationships, engineering activities, and international
networking forums, a set of guiding principles have emerged for successful approaches to an open
exchange point.
Exchange points support the homing of multiple international links and provide high capacity
connectivity to I2 and ESnet. They also provide maximum flexibility in connectivity and peering,
for example services at multiple layers of the network.
3.5. EMERGING NETWORKS
The Network Startup Resource Center (NSRC) [46] develops and enhances network infrastructure
for collaborative research, education, and international partnerships, while promoting teaching and
training via the transfer of technology. This IRNC project focuses NSRC activities on cultivating
cyberinfrastructure via technical exchange, engineering assistance, training, conveyance of
networking equipment and technical reference materials, and related activities to promote network
technology adoption, and enhanced connectivity in R&E sites around the world. The end goal is
to enhance and enable international collaboration via the Internet between U.S. scientists and
collaborators in developing countries. Active progress has occurred in National Research and
Education Networks (NRENs) and Research Education Networks (RENs) in Southeast Asia,
Africa and the Caribbean; this work will continue through NSF award #1451045 in the amount
of $3.7M to Steve Huter, Dale Smith and Bill Allen for “IRNC: ENgage: Building Network
Expertise and Capacity for International Science Collaboration” starting October 1, 2014.
3.6. EMERGING NETWORK TECHNOLOGIES
The StarLight experimental IRNC has had extensive experience with SDN and has supported
multiple international demonstrations and research projects in this area. IRNC funded networks
have also participated in the GLIF community that has developed the Network Service Interface
(NSI) standard [47] through the Open Grid Forum (OGF) [48]. NSI describes a standardized
interface for use at optical network exchange points, providing a foundation for automated
scheduling and deploying of optical circuits across provider and technology boundaries. The
production IRNCs had not yet deployed SDN, with the exception of some work begun in AmLight
in summer 2014, leveraging accomplishments of the Global Environment for Network Innovations
(GENI) program [49], I2’s Advanced Layer 2 Services (ALS2) [50] configuration tool, and some
GENI funded work in Brazil.
16
3.7. ADDITIONAL NETWORKS FOR INTERNATIONAL SCIENCE
In the past, research conducted at the North and South Poles, on board ships, in space, or using
distributed sensors relied on workflows where data was stored on site at or in the instrument and
then manually transported on some schedule to an analysis center. Due to the rapidly growing
satellite and other non-terrestrial telecommunications infrastructure, workflows are shifting from
periodic and manual to near real-time, using the Internet.
Whether manual or by Internet, the data at some point becomes connected to the national and
international research networks where data sharing and collaboration occur. In addition to the
NSF-funded IRNC networks, international science relies on shipboard, satellite and space
networks to capture and forward data. Examples of these networks include:
 The Global Telecommunications System (GTS) [51], global network for the transmission
of meteorological data from weather stations, satellites and numerical weather prediction
centers.
 HiSeasNet [52], a satellite communications network designed specifically to provide
continuous Internet connectivity for oceanographic research ships and platforms.
HiSeasNet plans to provide real-time transmission of data to shore-side collaborators;
basic communications including videoconferencing, and tools for real-time classroom
and other outreach activities.
 The NASA Space Network [53] consists of the on-orbit telecommunications Tracking
and Data Relay Satellite (TDRS) satellites, placed in geosynchronous orbit, and the
associated TDRS ground stations, located in White Sands, New Mexico and Guam. The
TDRS constellation is capable of providing nearly continuous high bandwidth (S, Ku,
and Ka band) telecommunications services for space research, including: the Hubble
Space Telescope [54], the Earth Observing Fleet [55] and the International Space Station
[56].
 Certain applications such as the LHC and NSF Division of Polar Programs fund their
own dedicated network circuits. LHC leverages the ESnet network. Polar traffic is
limited by geographic locations. Several research programs use both ESnet and IRNC
networks.
4. METHODS
The primary purpose of this study was to project the amount of data being exchanged via IRNC
networks in the year 2020. The initial study plan was to survey the IRNC network providers,
review their annual reports, examine measured network traffic over the IRNC links, and conduct
interviews with representative international science programs.
This approach proved challenging because of wide variation among IRNC providers in their degree
of participation in and knowledge of projects using their networks. Another challenge was that
most of the IRNC networks were measuring/recording only total bandwidth utilization, an
17
approach providing limited information to analyze.
providing more detailed network history information.
The GLORIAD network was unique in
A survey with detailed questions regarding file size, time to transfer requirements, type of file
systems the files are stored on and so forth was prepared but after conducting several in-person
interviews, it became apparent that few international science projects have detailed information
about their current and future plans to produce, transport, store, analyze and share data at a level
of detail that is useful for network capacity and service planning. ESnet and the NSF Polar
Programs have addressed this challenge by organizing special meetings at which program
scientists sat down with network engineers and spent a couple of days working through these
details.
This report’s advisory committee recommended that as an alternative, domain experts be asked to
provide a description of data trends in their fields, taking into consideration the following factors
that would be likely sources for increased IRNC traffic:
 New instruments with higher data/transport requirements
 Scaling up of current activities (e.g. more people, more data or use of data)
 New areas of the world increasing their traffic through local/regional improvements
(Africa, Pacific Islands)
 New technology that that reduces, by orders of magnitude, the cost of collecting/retaining
data.
 Programs currently funding their own communications network (e.g. the NSF Polar
programs) who may eliminate move to the R&E networks
In addition, a survey for NSF PIs was carried out to further explore these same questions.
All surveys used are included in the appendix section, along with a list of persons interviewed and
reports referenced.
4.1. SURVEY OF NETWORK PROVIDERS
The PI or PI-designated representative for each of the IRNC production and research networks and
exchange point operators were interviewed over the period September 2013 – January 2014. The
survey used can be found in Appendix A. Questions were designed to collect data on current
capabilities, data volume, user community needs, user support approach, upgrade strategies, and
data projections. The questionnaire was also used with one regional network provider who
connects to the IRNC. A summary of survey responses is available in section 4.1.
4.2. AVAILABLE NETWORK MEASUREMENTS
The IRNC network providers were asked to provide measures of network performance for their
networks. Most IRNC networks could provide some measure of bandwidth utilization over time.
GLORIAD was the one network provider that had been maintaining records of IP flows (one flow
18
typically corresponds to one application) over its ten years of operation. The absence of more
detailed information about network traffic on IRNC networks was explained by providers as
resulting from (a) strict concerns among the European research community regarding privacy, (b)
challenges in developing multiple bi-lateral policies among allowing such measurement, and (c)
lower priority/lack of funding. A summary of available measurements is described in section 4.2.
4.3. IRNC UTILIZATION COMPARED TO ESnet AND GLOBAL INTERNET
TRAFFIC
Data representing IRNC traffic that was available for the period 2009-2013 was compared to an
analysis of traffic on the global Internet for the same period of time. The analysis is in section 4.4
4.4. DATA TRENDS BY SCIENCE DOMAIN
Science domain experts provided written descriptions of these trends. These contributions are
available in section 4.4.
4.5. CATALOG OF INTERNATIONAL BIG DATA SCIENCE PROGRAMS
A questionnaire was developed for interviewing science communities requiring international data
exchange. The target science communities were identified by asking IRNC providers to identify
science disciplines that produced the highest data volume both now and potentially in the future.
In addition, scientists from large programs in those disciplines were asked to describe their current
data collection and storage volume and needs, the resources utilized to transmit data, technical
community interaction, and data projection strategies.
Due to high variation in quality and depth of responses, the study moved toward broader
exploration of international big science programs via Internet searches, attending a variety of
science domain community meetings, and through responses to an on-line survey of NSF Principal
Investigators. This survey (Appendix B) was designed to capture existing and planned
international science collaborations, knowledge of new instrumentation, and extent of international
collaboration within NSF funded programs.
Using publicly available information from the NSF awards site, 30,897 PIs receiving NSF funding
from FY2009 – July 2014 were invited to respond to the survey. A total of 4,050 persons
responded, a 13% response rate. At approximately the same time (summer 2013), Jim Williams at
Indiana University and Internet2 began collecting “The International Big Science List” [57].
These efforts have been much-expanded and placed within a framework for describing scientific
data. Called the “Catalog of International Big Data Science”, this interactive and updatable
website is available at http://irnc.clemson.edu.
19
5. FINDINGS
5.1. FINDINGS: SURVEY OF NETWORK PROVIDERS
5.1.1. Current IRNC Infrastructure and Capacity
The IRNC production networks operate using industry-standard TCP/IP networking, and IRNC
providers described their services as state-of-the-art. The experimental StarLight network supports
the use of specialized high performance protocols such as UDT [58]. The IRNC experimental
network used optical networking technology and protocols developed by GLIF.
StarLight has more than ten years of experience with programmable networking in support of
many international Grid projects. Approximately seven years ago, StarLight began investigating
SDN/OpenFlow [59;60] technologies. Subsequently, the StarLight community worked with GENI
to design and implement a national wide distributed environment for network researchers based
on SDN/OpenFlow, using a national mesoscale L2 network as a production facility. For over three
years, with IRNC and GENI support, and with many international partners, StarLight has
participated in the design, implementation and operation of the world's most extensive
international SDN/OpenFlow testbeds, with over 40 sites in North American, South America,
Europe and Asia. With support from GENI and the international network testbed community, a
prototype Software Defined Networking Exchange (SDX) was designed and implemented at the
StarLight Facility in November 2013 and used to demonstrate its potential to support international
science projects.
For over five years, the StarLight consortium has worked with the GLIF community and OGF to
develop and implement an NSI Connection Service. More recently, StarLight has been supporting
a project that is integrating NSI Connection Service 2.0 and SDN/OpenFlow.
All IRNC networks except TransLight/PacificWave have at least one 100Gbps network path; this
matches the transition of campus, regional and national R&E network providers to 100Gbps
external network speeds. The high cost of crossing the Pacific Ocean (a cost factor of 5 higher
than the Atlantic) presents a challenge. TransLight/Pacific Wave provides a 100Gbs path from
LA to Hawaii, and a 40Gbps total bandwidth to Australia, New Zealand and elsewhere in Asia.
5.1.2. Current Top Application Drivers
Applications currently having top bandwidth or other demanding network requirements that were
named by IRNC providers included:









The Large Hadron Collider ( Tier 1 transfer from CERN to Europe, US and Australia)
Computational Genomics
Radio telescopes
Computational astrophysics
Climatology
Nanotechnology.
Fusion energy data
Light sources (synchrotrons)
Astronomy
20
5.1.3. Expected 2020 Application Drivers
Looking forward to 2020, IRNC network providers expected application drives to remain the same
as in section 4.1.2, with the addition of:

More visualization

Astronomy moving away from shipping tapes/drives for near-real time reaction to events
in order to verify the event and focus observation instruments.

Video, live and uncompressed.

Climate Science and Geology: collecting more LIDAR data as needed, combined with
other data (eg: earthquake monitoring and response)

Larger sensor networks, especially in portions of the globe where there is currently no
weather data being collected.

The Square Kilometer Array [61] being built in Australia and South Africa

The catalog of life on this plant is growing larger and larger; it will be stored in many
locations. What’s complicated is the coordination needed to become a single data set; some
type of federated model is needed.

Data is currently concentrated in US but will be more global in nature; look at where new
telescopes are located. Consider where population density is (China, India). Data will
become global in nature in terms of where it needs to go and where it will rest.
5.1.4. Interaction of Network Operators with Researchers
Network operators interacted most frequently with other people supporting R&E networks, and
typically had infrequent interaction with scientists.
“We are often surprised and discouraged by the overall lack of interaction between
researchers/scientists and their network operators, sometimes within the same
institution. “ (NSRC interview)
The AmLight and StarLight programs were exceptions to this pattern, and each reported the most
detailed knowledge of end-user applications.
5.1.5. Current Challenges
When asked to describe current challenges they face, the IRNC network providers identified the
following:
1) Lack of wide area network knowledge on campuses
IRNC providers’ experience is that most network problems reported were caused by the end
system; for example, an underpowered machine having inadequate memory, slow hard disk, or
slow network card. A misconfiguration in the campus Local Area Network was another example.
IRNC providers view the role of campus network staff as sitting at the edge of the web of regional,
national, and international network connections; they are responsible for connecting their campus
to this web. In this role, campus network staff are an essential component of end-to-end support
but, unfortunately, are frequently not knowledgeable about wide area technologies and therefore
21
cannot provide problem solving assistance in end-to-end problem solving without the IRNC (or
regional or national network) providers’ assistance. Network path troubleshooting is still a very
people intensive process. The person who can do this needs a well-rounded skill set: he/she must
understand end-user requirements, storage, computation, and the application as well as
networking. There is room for automation here. IRNC providers would like to see more training
for campus network staff in in BGP, SDN, and wide area networking.
2) Inadequate and uneven campus infrastructure
There are challenges getting the local network infrastructure (wiring/switches) ready to support an
application. Campuses have multiple and perhaps conflicting demands for investment in the
campus wiring and network electronics infrastructure. Getting wiring and electronics upgraded
all the way to a specific end user’s location in the heart of a campus may not be a high priority for
the campus.
3) Poor coordination among network providers
The regional, national, and international network web of connections are not well coordinated. As
a result, it can sometimes be difficult to identify the right person to contact during end-to-end
troubleshooting, and there may be inefficiencies in the investment. The network path crosses
multiple organizations; the hard part is figuring out which network segment has the issue, then
working with individual researchers to fix the local system or network.
4) Interoperability Challenges
IRNC providers face interoperability challenges in connecting 10Gbs and 100Gbs circuits; optical
networks and software defined networks; and the different implementations of SDN provide
additional challenges.
5) Adoption of New Network Technologies
IRNC providers expect rapid growth in optical networking and SDN in the next 3-4 years. They
are concerned that science communities don’t appear to know anything about this yet. They are
also concerned about whether it will be easy enough for end-users to use.
5.1.6. What are Future Needs of International Networks?
Most IRNC providers identified science communities’ requirements as the best drivers for future
network directions. In general, scientists are always pushing the frontiers of advanced networking
and thus encounter new problems.
The network needs to be thought of as a global resource. It is important to work collaboratively
to coordinate and provide solutions that cross boundaries. Organizations that foster working
together across international boundaries are needed. The NSF Exchange Point program and GLIF
were mentioned as two examples that are working well.
West Coast providers were particularly concerned about funding.
“The US government funds only circuits into US, not the other way around. Other
countries are now paying much more than the funding provided by NSF. And the
focus on instruments of interest are shifting away from the US – eg: LHC and large
22
square K array. The budget should shift by a factor of 10, particularly on the west
coast where 10G across the Pacific is a factor of 5 higher cost than the Atlantic.”
5.1.7. Exchange Point Program
The International Exchange Point program is seen by IRNC network operators as a success:
“It is terrifically important and significant and successful.”
“I think it’s been very important & will be more so, later. More small countries
are coming in with their own NRENS. Geological, climate and genomics are
requiring information coming in from all over. “
“US exchange points are very important – without them, the US would not be as
much in the center of this as it is. I worry about the long-term impacts of the global
R&E community’s reaction to the allegations about NSA. There are some
communities that are spending their own money to get to US facilities, but are
talking about going elsewhere as a result of this. From a US perspective, we must
support these exchange points – they are really, really important.”
5.2. FINDINGS: NETWORK MEASUREMENT
Measuring the characteristics of network traffic is the foundation for understanding the types of
applications (eg: streaming video, large file transfer, email) on the network, frequency of these
applications, the number and location of users, network performance, etc.
Network traffic monitoring and the level of detail of any monitoring vary significantly across
IRNC network providers. Recent efforts to standardize measurement have focused on universal
installation of the perfSONAR platform [62] that can be very useful in understanding actual
network performance on each link of a network path when debugging end-to-end application
issues. However, with the exception of the GLORIAD network monitoring, long-term
performance measurement reporting is limited to bandwidth utilization.
A typical bandwidth utilization report is represented in Figure 8, showing average use of 40Gbps
links into and out of the PacificWave traffic router during Q4-2103. Blue represents incoming
traffic with an average utilization of 14.45Gbps; green represents outgoing traffic with an average
utilization of 15.32Gbps. Each line in the graph is itself an average over the portion of the week
represented. The bursty nature of network traffic is reflected in the shape of the graph; the spikes
or peaks show that on occasion traffic can be double the average, thus the “headroom” requirement.
It should be noted that the graph represents only the public portion of the PacificWave exchange
and does not represent all of the traffic over the facility. There are private connections and
CAVEwave [63] and CineGrid [64] traffic (part of the StarWave experimental network) that are
not included in these numbers.
23
FIGURE8.QUARTERLYPACIFICWAVETRAFFICFOROCT,NOV,ANDDEC2013 Figure 9 demonstrates the rapid rate at which scientists discover and utilize available tools. The
graph was derived from 62 inbound PacificWave quarterly graphs covering the time period
January 2006-December 2013. The dark blue horizontal lines indicate link capacity; bandwidth
was increased from 1Gbs to 10Gbs (210) and then 40Gbs (2012). The light blue vertical bars
represent each quarter’s average throughput for that period of time, and the black ‘T’ shape
above the quarter average indicates the peak throughput recorded for that period.
FIGURE9.EXAMPLEOFIRNCNETWORKUTILIZATIONOVER8YEARS
24
5.2.1. GLORIAD’s Insight Monitoring System
The GLORIAD “Insight” system [65] provides flexible, interactive exploration and analysis for all
GLORIAD backbone traffic since GLORIAD’s beginning as “MIRnet” in 1999. Insight is open
source software developed by GLORIAD in collaboration with the China Science and Technology
Network (CSTnet) [66] and Korea's KISTI. Large IP flows are the units measured, and searchable
information includes traffic volume, source/destination country, packet loss, source/destination by
U.S. State, traffic volume by application type or scientific discipline, network VLAN or ASNUM,
or network protocol. Both live and historical data is available. Figure 10 summarizes all
GLORIAD traffic from 2009-2013 by world region; Figure 11 provides an example snapshot of
a packet loss incident. Total data stored to date comprises almost 2 billion records, with a million
new records added each day.
FIGURE10.GLORIADLARGEFLOWTOTALTRAFFIC2009‐2013,BYWORLDREGION
Packet loss is a measure indicating significant network congestion and/or interruptions in network
service. Insight allows network operators to drill down into live traffic during packet loss to
problem solve. The drill-down path can follow any of the flow’s recorded attributes such as
protocol, application, institution, etc.
25
Insight categorizes applications using
well-known
communication
port
numbers or that exhibit certain wellknown behaviors. Using GLORIAD’s
historical network records and Insight, it
was possible to compare IP flow data for
the year 2009 to year 2013. In terms of
total bytes per year, GLORIAD
recorded 0.6 PB in 2009 and 3.7 PB in
2013, a cumulative growth factor of 5.7
(Figure 12).
FIGURE11.INSIGHTPACKETLOSSDISPLAYEXAMPLE
o The File Transfer Protocol (FTP)
[67;68] is no longer a top 10
application; large data files are
being moved by other applications
such as the Aspera high throughput
file transfer software [69].
Insight also reports the top applications
in terms of total bytes for that
application. Comparing 2009 to 2013
(Figure 13: Strike-thrus in the labels
show applications present in only one of
the summaries), important changes
include:
GLORIAD Annual Top 10 Large Flows Total PetaByes : 5.7 Growth Factor
PB
4.0
3.5
3.0
2.5
o
IPV6 [70;71] is no longer being
tunneled thru IPV4 [72;73], so it is
no longer listed as an application;
this reflects significant adoption of
IPV6, which is especially important
to developing countries who arrived
at the Internet after the IPV4 address
space was exhausted.
2.0
1.5
1.0
0.5
0.0
2009
2013
FIGURE12.GROWTHINGLORIADTOTALANNUALBANDWIDTH
OVER5YEARS
o The appearance of HTTPS indicates a higher level of attention to securing communication with
encryption.
Insight also displays a map of shaded geographic regions; each flow is labeled by country, or US
state, and Regions with more hits are shaded darker, indicating locations sending or receiving the
greatest number of flows. For each year, the leftmost side maps Source flows, while the right hand
side maps Destinations. Comparing the year 2009 to 2013 (Figure 14) shows that within the U.S.,
many EPSCoR jurisdictions [74] have fallen behind in building their R&E network capabilities,
26
2009 Top Ten Applications
Other (TCP)
2013 Top Ten Applications
HTTP
Other (TCP)
HTTP
Other (UDP)
Other (UDP)
Aspera
Aspera
ROOTD
ROOTD
HTTPS
HTTPS
NICELink
NICELink
SSH
SSH
Unlabeled
Unlabeled
COMMplex
COMMplex
HyPack Data Ac
HyPack Data Ac
Unidata LDM
Unidata LDM
FTP
FTP
Other (IPV6)
Other (IPV6)
FIGURE13.CHANGEINAPPLICATIONSMOSTFREQUENTLYUSEDFORGLORIADNETWORK,COMPARING2009
TO2013
although some like South Carolina and Alabama have actually increased their use of the
international R&E networks.
2009 2013 FIGURE14.TOTALBANDWIDTHPLOTTEDBYUSSTATESHOWSADROPINBANDWIDTHFROMEPSCOR
JURISDICTIONS,PROBABLYDUETOLOWERINVESTMENTININFRASTRUCTUREUPGRADES.(SOURCE
LOCATIONISONLEFTHANDSIDEOFEACHPANEL,DESTINATIONLOCATIONONTHERIGHT)
5.2.2. The Angst over measurement and data sharing
At least two meetings have been held that included IRNC network providers and addressed the
topics of IRNC measurement and IRNC cybersecurity; these meetings issues reports with
recommendations, including “Security at the Cyberborder Workshop Report: Exploring the
relationship of International Research network Connections and CyberSecurity” [75] and
“International Research Network Connections: Usage and Value Measurement” [76]. At a January
2013 PIs meeting, each IRNC project summarized existing baseline measurement capabilities and
described possible target measurement capabilities and services beyond the current IRNC program
27
phase. The discussion emphasized the usefulness to funding agencies, users, and resource
providers of the new Extreme Science and Engineering Discovery Environment (XSEDE) [77]
resource utilization tool XDMod [78], especially its support for “drill down” into finer detail.
IRNC PIs mentioned a host of privacy and technical issues they consider obstacles to similar
visibility into government-funded network resources. Transparency in network use was described
as a complex partnership between NSF, grantee institutions, carriers, and international partners
(although most of these were not represented in this conversation). Existing data sharing
frameworks and display strategies could be put to work if the data were available and standardized.
Both meetings agreed that a lowest common denominator for agreeing to share data would be to
passively capture and report on DNS packets. However, these recommendations have not been
implemented yet by all participants.
5.3. IRNC NETWORK TRAFFIC COMPARED TO ESnet AND THE GLOBAL
INTERNET
How does this analysis of GLORIAD’s network traffic over the last 5 years compare to what is
known about other research networks, or even the commodity Internet traffic over the same period
of time?
5.3.1. Synopsis of data trends for the ESnet International Networks
FIGURE15.ESNETPROJECTEDLARGE‐SCALESCIENCEDATATRANSFERSCOMPARED
TOHISTORICALTRENDS
28
Author: Brian Tierney, Staff Scientist, ESnet Advanced Network Technologies Group, Lawrence
Berkeley National Laboratory
The DoE operates a network that is dedicated for use by DoE applications and requirements; it is
called ESnet, and ESnet also operates internationally.
Because the set of applications supported is much smaller than that of the IRNC, and perhaps other
factors, DoE has conducted application requirements workshops for their users and has published
projected capacity requirements [79;80].
Based on data collected at the 2012 requirements workshops summarized in the report, ESnet
estimates that traffic to Europe will continue to grow exponentially until at least 2022 (Figure 15).
ESnet Big Data Applications
The LHC, the most well-known high-energy physics collaboration, was a driving force in the
deployment of high bandwidth connections in the research and education world. Early on, the LHC
community understood the challenges presented by their extraordinary instrument in terms of data
generation, distribution, and analysis. This detector will be back online again in early 2015, and it
is expected to produce 2-5 times more data per year after that [81;82].
In climate science, researchers must analyze observational and simulation data sets located at
facilities around the world. Climate data is expected to exceed 100 exabytes (1 exabyte = 1000
petabytes) by 2020. New detectors being deployed at X-ray synchrotrons generate data at
unprecedented resolution and refresh rates. The current generation of instruments can produce
300 or more megabytes per second and the next generation will produce data volumes many times
higher; in some cases, data rates will exceed DRAM bandwidth3, and data will be preprocessed in
real time with dedicated silicon [83]. Large-scale, data-intensive international science projects on
the drawing board include the International Thermonuclear Experimental Reactor (ITER) and the
SKA, a massive radio telescope that will generate as much or more data than the LHC. The Belle
II experiment, based in Tsukuba, Japan, is part of a broad-based search for new physics involving
over 400 physicists from 55 institutions across four continents. The first data from Belle II is
expected in 2015, and data rates between Japan and the USA are expected to be about 19 Gbps in
2018 and 25 Gbps in 2021 [84].
ESnet collects traffic utilization data on its entire international links, but unfortunately they do not
currently have an easy way to aggregate that data into detailed overall summaries. Current
utilization data is available at http://graphite.es.net/. ESnet also runs an Arbor Networks NetFlow
analysis system that provides the ability to do some analysis. Figure 16 is an example analysis,
which seems to show ESnet’s international traffic has roughly doubled in the past three years.
(Interestingly, IRNC traffic as represented by GLORIAD experienced a four-fold increase over
the same period of time).
3Memory(DRAM)bandwidthistherateatwhichdatacanbereadfromorstoredintoasemiconductor
memorybyaprocessor.Exceedingthisbandwidthmeansdataisbeingproducedfasterthanitcanbestored.
29
FIGURE16.ESNETREPORTFORPERIODNOV2011‐MAR2014
Data Transfer Bottlenecks
ESnet’s “Science Engagement” team spends a lot of time helping scientists improve their data
transfer capabilities. Based on their experience, performance bottlenecks usually fall one or more
of the following categories. (Usually more than one).

Hosts that are not properly tuned for high-latency networks

Using the wrong tool (i.e.: scp instead of GridFTP)

Packet loss, which is caused by one of the following
o Undetected dirty optics/ bad fibers
o Underpowered firewalls
o Under-buffered switches/routers
Note that packet loss is rarely caused by networks that are oversubscribed. See ESnet’s “Science
DMZ” paper for more details [85].
5.3.2. Global Internet Growth Rate
Cisco, a major IP router vendor, provides current forecasts of Internet growth rates called the Cisco
Visual Networking Index (VNI) [86]. Other sources referenced in the literature [87;88] have not
been updated since 2009.
30
The Cisco VNI published in June 2010 predicted growth over the period 2009-2013 [89] which is
the period of time covered in this study. Comparing those predictions to the IRNC experience as
exemplified by GLORIAD data is shown in Table 1.
Cisco VNI 2010 Prediction for
Internet Growth in 2009-2014
Actual Global Internet
Growth 2009-2013
GLORIAD Data 2009-2013
Global IP traffic will increase
Global IP traffic increased
GLORIAD traffic increased by a
more than 4.3
by a factor of 5
factor of 5.7
By 2014, the highest IP-traffic
generating regions will be North
America, Asia Pacific, Western
Europe, and Japan .
The primary growth driver will
be video, exceeding 91% of the
consumer market by 2014,
BUSINESS
GLORIAD
highest
IP-traffic
generating regions in 2013 were
Asia Pacific, North America,
Western
Europe,
Russian
Federation (Note: GLORIAD is only
one of 5 IRNC network providers)
The primary
driver is video.
bandwidth
Web-based video conferencing
will grow 180 fold from 20092014
For the first time since 2000,
peer-to-peer (P2P) traffic will not
be the largest internet traffic type,
being replaced by video.
Data tagged as audio/video was less
than .001 % of total data; 99% of the
data was transported using TCP. An
example “Big Data” flow is Aspera, a
high speed file transfer software.
Flows labeled as audio/video grew
by a factor of 200.
Internet video dominates
global network traffic; P2P is
declining globally [90]
No data on P2P, but note reduction
in UDP traffic (Figure 13)
TABLE1.COMPARINGCISCOVNI2009PREDICTIONSTOACTUALGLOBALINTERNETGROWTHAND
ALSOTOGLORIADHISTORICALDATA
We can conclude from Table 1 that IRNC networks are growing at a rate similar to Cisco’s
predictions regarding the commodity Internet, but with the key difference that a primary growth
driver is data. This distinction is no small matter; although GLORIAD constitutes only a tiny
portion of global Internet traffic (3.2E-06) it must be engineered for scientific application patterns.
R&E networks really do matter! Therefore, industry forecasts can be useful in planning IRNC
network capacity and design, with attention paid to the unique distribution of applications.
Another important change in Internet traffic over the period 2009-2013 has been the migration of
a significant amount of network traffic from global transit providers to cloud (content) providers.
Traffic patterns between network administrative domains are evolving, specifically a significant
shift in Internet inter-domain traffic demands and peering polices [90]. Beginning with a
hierarchical model where global transit providers interconnect smaller tier-2 and regional/tier-3
providers, an evolution has occurred where the majority of commercial inter-domain traffic by
volume now flows directly between large content providers (eg: Google, YouTube), data centers,
and consumer networks. The result is that the majority of traffic will bypass long-haul links, and
content delivery and cloud service providers connect directly to metro and regional backbones.
31
This change in traffic patterns does not appear within R&E networks, probably due to slower
adoption of both commercial and private cloud technologies in higher education research.
5.3.3. Industry Internet Growth Forecast for 2013-2018
Having demonstrated that industry forecasts are useful to the IRNC, it is valuable to reflect on key
findings in the most recent predictive report (Cisco VNI). A graph of the predicted growth trend
(Figure 17) shows growth to an expected global IP traffic of 1.6 ZetaBytes (one billion GigaBytes)
per year by year 2018. To give some sense of the size of this number, the gigabyte equivalent
would be all movies ever made crossing the global Internet every 3 minutes.
Key predictions in the Cisco report include:
 The Internet grew by fivefold from 2009-2014, largely due to the growth in number and
use of mobile devices such as smart phones and tablets. In the next three years (thru 2018),
the rate will grow at a threefold pace.

Busy-hour Internet traffic is growing more rapidly than average Internet traffic (growing
at a rate of 3.4).

Content delivery networks will carry more than half of Internet traffic by 2018.

Traffic from wireless and mobile devices will exceed traffic from wired devices by 2018.

“The Internet of Things”: the number of devices connected to IP networks will be nearly
twice as high as the global population by 2018.
o Devices and connections are growing faster than both the world population and
Internet users
o This trend comes from increasing numbers of devices per person/household, and also
a growing number of machine-to-machine (M2M) applications.
o There will be nearly one
M2M connection for each
member of the global
population by 2018.


ZB
Growth of Internet IP Traffic
1.80
1.60
1.40
Broadband speeds will nearly triple
by 2018 (to 42Mbps)
Globally, IP video traffic will be 79
percent of all IP traffic, for both
business and consumer; all forms
of video will be in the range of 8090 percent of global consumer
traffic by 2018.
1.20
1.00
0.80
0.60
0.40
0.20
0.00
2008
2009
2014
2015
2018
FIGURE17.ESTIMATEDGROWTHISFROM0.75
ZETABYTES TODAYTO1.6ZETABYTESBY2018
32

Globally, mobile traffic will increase 11-fold between 2013 and 2018, reaching 191
ExaBytes annually by 2018.

By 2018, about half of all fixed
and mobile devices and
connections will be IPV6
capable.

IP traffic is growing fastest
(compound annual growth rate
of 38%) in the Middle East and
Africa, followed by Asia
Pacific. By 2018, Growth will
reach 131.5 ExaBytes per
month of additional traffic,
distributed across regions as
shown in Figure 18.
Regional Distribution of New Traffic in 2018
Middle East and
Africa
Latin America
Central and Eastern
Europe
Western Europe
4%
7%
8%
36%
14%
North Africa
Asia Pacific
31%
FIGURE18.REGIONALDISTRIBUTIONOFNEWTRAFFICIN
2018
5.4. FINDINGS: DATA TRENDS
Science domain experts were invited to describe data trends in their respective fields, including
describing the types of data, growth and impact of international collaborations, and factors that
might impact the amount of data produced by 2020, including new instrumentation, technology,
or methods of analysis. Physics and climate science were described previously in the ESnet Big
Data Applications Section 4.3.
5.4.1. Synopsis of Data Trends in Astronomy and Astrophysics
Author: M. L. Norman, Professor of Physics, University of California San Diego and Director,
San Diego Supercomputing Center
Types of Data
Four international astronomy projects will drive international R&E network traffic in the early
2020’s: LSST, ALMA, SKA, and JWST. The Large Synoptic Survey Telescope (LSST) will
generate large (3.2 Gpixel) image files of the sky in six color bands, as well as a large object
catalog. The Atacama Large Millimeter Array (ALMA) and the Square Kilometer Array (SKA)
will produce raw radio interferometry data which consists of UV-plane visibilities and
autocorrelations for many frequencies. The James Webb Space Telescope (JWST) will generate
multicolor images and spectra of individual objects in the infrared and submm parts of the
spectrum. Other types of observational astronomy data from smaller projects includes images,
spectra, spectral data cubes (2 space, 1 frequency), and time series data. Virtually all
observational astronomy data is stored in Flexible Image Transport System (FITS) [91] format
files, which is a self-describing portable data format.
33
FIGURE19.ESTIIMATEDANNUALDATAPRODUCTIONFROMBIGSCIENCEPROJECTSBY2020.SIZEOFDOT
REPRESENTSPBOFDATAPRODUCEDATTHATSITEANDNEEDINGTOBEMOVEDACROSS
INTERNATIONALBOUNDARIES.REDAREINSTRUMENTS;ORANGEAREDATAREPOSITORIES;GREENIS
DATACONTROLTRANSMISSION;BLUEISOTHERDATA.
Astrophysical and cosmological simulations produce field and particle data in 1, 2, or 3
dimensions. Data from such simulations is output as multivariate array data and particle lists
with record spatial coordinate, velocity components, mass, and other particle attributes. Many
simulations have adopted the portable HDF file format [92], which is a self-describing
hierarchical data format, but others output application-specific binary files.
Trends from simulation to observation, or combinations of these
Simulation has two meanings in this context: (1) simulated observations, used to mock up
observational data for the purposes of developing/assessing automated analysis pipelines; and (2)
astrophysical simulations. The above mentioned astronomy projects make extensive use of
simulations of the first kind. In the run-up to full science operations, large amounts of simulated
observations are generated—comparable in size to actual data—to test the readiness of the
distributed data processing infrastructure.
Increasingly, large cosmological simulations are being performed for the purpose of
designing/optimizing observing strategies for measuring dark energy or probing the epoch of
reionization.
Trends in international collaboration and/or international data sharing
All the projects mentioned above are international due to their high cost and one-of-a-kind
nature. The main international partners for LSST are Chile, France, and the Czech Republic. The
main international partners for ALMA are Chile, European Southern Observatory, and Japan.
Main international partners for SKA are Australia, Canada, China, Germany, Italy, New Zealand,
South Africa, Sweden, the Netherlands and the United Kingdom. The US is not participating at
the present time. The main international partners for JWST are the European Space Agency and
34
Canada. All these projects involve international data grids of some sort where science data is
replicated in archives operated by the member nations.
The primary LSST archive will be at NCSA, University of Illinois Urbana-Champaign and will
be replicated in Chile. NCSA and Chile will also host DACs (science Data Access Centers). The
primary ALMA data archive will be located at the National Radio Astronomy Observatory
(NRAO) [93] in Charlottesville, VA, with satellite archives operated in France and Japan. The
SKA instruments will be located in Australia and South Africa. Data from Australia will travel
from Perth to London via the USA. Data from South Africa will travel directly to London, where
the European distribution point is located. An LHC tiered data grid is envisioned to distribute
the data from London to national Tier 1 data centers. Since Canada is a partner, it is expected it
will host such a Tier 1 center, which creates the possibility that SKA data will flow to Canada
via the USA.
Numerical simulators are emulating their observational counterparts by deploying online
“numerical observatories” which are Structure Query Language (SQL) [94] and No-SQL [95]
databases which serve up raw and derived data products over the Internet. Three conspicuous
examples are the Millennium Simulation database [96], the Bolshoi Simulation MultiDark
database [97], and the Dark Sky Simulations database [98].
New instruments, methods or use modalities that are scaling up current activities
LSST has a 3.2 Gpixel CCD array camera that will photograph the night sky every 10 seconds,
producing a dataflow of 15 TB of image data per night. With expected 10-year project duration,
some 200 PB of data will be amassed by the end of the project. ALMA is a new large radio
interferometer that is in operation now, and will continue through the next decade. ALMA data
link speeds from Santiago, Chile to Charlottesville, VA will increase from ~100 Mb/s now to ~1
Gb/s by 2016, and then to 2.5 Gb/s by 2020 as more data-intensive observations are planned.
SKA will produce an unprecedented data rate of ~15 Pb/s aggregate between the receivers and
the central correlator. However none of this will touch international research networks.
Processed science data is estimated to flow into the European distribution point at a sustained
rate of ~100 Gb/s and thence onto the national Tier 1 data centers at 30 Gb/s each. No estimates
are available for JWST data, but since this is a space observatory the primary telemetry data does
not enter the US on the wired Internet. Heavy international access is anticipated to the MAST
archive located at the Space Telescope Science Institute in Baltimore, MD.
New technology that substantially reduces cost of collecting/retaining data.
The National Center for Supercomputing Applications (NCSA) [99] has recently deployed the
largest High Performance Storage System (HPSS) [100] tape archive for open science data.
HPSS has added capability for Redundant Arrays of Independent Tapes (RAIT)—tape
technology similar to RAID [101] for disk. RAIT dramatically reduces the total cost of
ownership and energy use to store data without danger from single or dual points of failure
through generated parity blocks. It also enhances the performance of data storage and retrieval
since the data is stored and read/written in parallel.
Important bottlenecks in creating, storing, or making better use of data
The bottlenecks are myriad. Aside from the Blue Waters project [102] at NCSA, the NSF has not
funded an XSEDE-wide High Performance Computing (HPC) data archive for data preservation.
35
As a consequence archives at many XSEDE Service Provider sites have aged out and have been
decommissioned. Without a means to preserve data, there is no reason to create data access/data
discovery facilities around the data. HPC data largely exists as single copy data resident on the
Lustre file systems [103] of geographically-distributed HPC systems. Such data is not
discoverable or web-accessible. Researchers are turning to cloud providers to address this
problem [98], although it is unclear who will pay the costs for long-term data storage and
retrieval.
5.4.2. Data Trends in Bioinformatics and Genomics
Author: Stephen A. Goff, The iPlant Collaborative [104], BIO5 Institute, University of Arizona,
Tucson AZ
Genes, genomes, and traits have been studied and analyzed for over a century, beginning well
before biologists agreed that DNA was the genetic material. Recent advancements in automated
sequencing, high-throughput phenotyping, and molecular phenotyping technologies are rapidly
bringing biology into the “big data” era. DNA Sequencing technology is the prime example of
this advancing technology and is the gold standard of gene and genome identification and analysis.
Types of data
The types of data important for biologists include DNA sequence data, RNA sequence data,
phenotypic or trait data, environmental data, ecological data, and phylogenetic data.
New instruments, methods or use modalities that are scaling up current activities and new
technology that substantially reduces cost of collecting/retaining data.
DNA sequencing technology has decreased in cost approximately five orders of magnitude and
consequently DNA sequence data is increasing faster than exponential at this time. Several
thousand species have been sequenced (out of a few million species total) and the technology is
now being applied to varieties of crops to empower molecular breeding. Rice, as an example, has
over 200,000 known varieties and a few thousand rice lines have been sequenced and made
publicly available. The raw data from these few thousand varieties requires approximately 20
terabytes of storage space and about a few tens of terabytes scratch disc space for analysis
purposes. It is expected that hundreds to several thousand varieties of important crop plants will
be sequenced in the near future and this will generate petabyte levels of raw data to be stored and
analyzed. In addition to crop genomes and varieties, a few thousand plants from diverse groups of
species have been or are being sequenced. These represent a small percentage of the estimated
500,000 green plant species, but cover a broad range of diversity. Likewise for animals, a broad
range of diverse species has been or will be sequenced in the near future.
Advanced and inexpensive DNA sequencing technology is also being used to study gene
expression under different conditions. Instead of sequencing the genomes, RNA from expressed
genes (both protein-coding and non-coding) is purified and converted to DNA then sequenced.
This allows researchers to determine which genes are on in specific environmental conditions and
how an organism responds to a changing environment by changes in gene expression. This
technology is known as “RNA-Seq” and is the technology generating the highest amount of raw
data with modern sequencers. A typical, well-designed RNA-Seq experiment will generate 1-2
36
terabytes of data and with the current estimates of new instruments in use today, hundreds to
thousands of terabytes of raw data could be generated daily. RNA-Seq is an example of highthroughput “phenotyping”, where the phenotype under study is gene expression in a specific
condition. This technology is mainly applied in the developed countries, but will soon be applied
more globally as new sequencing instruments are delivered to dispersed locations. One major trend
is for sequencing to be done locally at a large number of research institutes versus at primarily at
a few large sequencing “centers”.
In addition to genome sequencing and gene expression profiling, high-throughput sequencing is
also being used to determine the offspring “genotype” created by breeding specific parental lines
of crops or livestock. This use is designated “genotype by sequencing” and is a technology poised
to create a large amount of data (petabyte levels, especially in the private sector) as molecular
breeding technology gains momentum. In the public sector, this data is likely to be generated
mainly by academic institutions focused on agriculture and livestock improvement as well as the
USDA. It is beginning to be supported by humanitarian foundations interested in improving crops
in developing countries where conventional plant and livestock breeding has not progressed as it
has in the western world.
In addition to RNA-Seq, there are several high-throughput phenotyping technologies that generate
significant datasets but are only beginning to be adopted broadly. Perhaps the most obvious is in
the field of imaging. Images from satellites are being used to detect pathogen spread through fields
on a large scale. Images are also being used to study the growth, development and health of crops
in specific fields. Microscopic images are being used to study gene expression over time in live
cells, and responses to environmental changes. Use of image technology in biological research is
expected to create petabyte levels of raw data. Medical images are one of the largest single datasets
at approximately 100 petabytes.
Laser images are being used to create three-dimensional patterns of growth in field crops (LIDAR
for example) and a single pass over a typical sized breeding station plot (10 hectares) is estimated
to generate 3-4 terabytes of raw data. There are thousands of such breeding station plots nationally,
and to get an accurate assessment of growth over time, daily measurements would likely be taken.
This data would need to move from dispersed field stations to a central analysis repository and
would require reasonably broad bandwidth and compute for complete analysis.
Another approach to phenotyping is “molecular phenotyping”. Each cell has several thousand
small molecule “metabolites” that are constantly in a state of flux. These small molecules provide
an indication of the health of the cell, the growth state, and the response to environmental
perturbations. Analyzing these small molecules is designated “metabolite profiling” or
“metabolomics” and consists of running several chemical extracts through mass-spectrometers
followed by computational analysis of the resulting spectral patterns. The data create by metabolite
profiling is currently fairly small (gigabytes per run), but has the potential of becoming quite large
as the technology matures and is applied to much larger target species and conditions. It’s difficult
to estimate the ultimate data size accurately at this point.
Trends from simulation to observation
37
Biology is moving from observation toward simulation and modeling. Biological organisms are
highly adapted, influenced by and tuned to environmental variables. In other words, specific
organisms are very dynamic and to understand biological mechanisms it is necessary to consider
and integrate environmental variables. The implication is that much more data will need to be
collected over time and analyzed to fully understand and accurately simulate how organisms
interact with the environment.
Trends in international collaboration and international data sharing
Larger and larger teams that cross international borders are doing biological science research.
These collaborations share genotypic and phenotypic datasets that are growing increasingly large.
Sharing terabyte levels of data is much more common today. Likewise, funding agencies are
strongly encouraging the sharing of raw biological research data across international groups.
Bottlenecks in creating, storing, or making better use of data
The main bottleneck in data creation, storage and usage is sharing large datasets generated from
dispersed research institutions, storing the data in consistent formats and annotating the data with
standard metadata descriptions to make it the most useful for researchers who were not involved
in the design of the experiments of creation of the raw data.
5.4.3. Data Trends in Earth, Ocean, and Space Sciences
Author: Sara Graves, Director, Information Technology and Systems Center and Professor of
Computer Science, University of Alabama in Huntsville
The emerging "data rich" era in scientific research presents new challenges and opportunities for
research. “Big data” can be described with four properties: volume (huge amounts data that is
constantly growing), variety (many types of data), velocity (ability to be available and used
quickly), and veracity (trusting data for decision-making). Data is fundamental to research and
technologies to discover, access, integrate, manage, transform, analyze, visualize, store, share and
curate data are core capabilities needed to exploit the wealth of available data. With a wealth of
data, and new tools and technologies, scientists can explore new ways to gain new knowledge and
solve problems that were previously not possible.
Types of Data
Geoscience researchers are working with a variety of globally distributed data, including field
observations and sensor-based terrestrial and/or airborne observations, simulation results and
experiment data. Many researchers work with derived data products, instead of primary data, for
combining with data in their own analyses. As science becomes more interdisciplinary, scientists
need data across multiple fields that are traditionally separate in the geosciences. In addition to
traditional scientific data, scientists require new analytics tools to process, analyze, and derive
knowledge from various structured and unstructured data, including text, audio/voice, imagery,
video, and multimedia. Access to real-time data is becoming more important as computing
capabilities improve and data becomes available. In addition to acquiring data in real-time, data
can be transformed and delivered to a user based on the user’s needs, dynamically creating new,
38
virtual data sets. Real-time data processing can support decision-making in real-time, presenting
new opportunities for a variety of applications with critical real-time requirements. Mobile
computing is also growing, with scientists now taking advantage of having the capability to acquire
or input data in the field.
Trends from simulation to observation, or combinations of these
While there are multiple communities in the geosciences, each with different research needs and
styles, there are fundamental overarching, crosscutting needs that are common to each. EarthCube
[105], an NSF program, is working towards the goal of transforming the conduct of research in
geosciences and encouraging the community-guided cyberinfrastructure development of a
coherent framework to integrate and use data and information for knowledge management across
the entire research enterprise. NSF is engaging the geoscience community to develop a governance
process to establish an organization and management structure to guide priorities and system
design, and groups were established to address identified areas of need.
Within EarthCube, the Data Discovery, Mining and Access (DDMA) group [106] was formed to
address the data needs for geoscience research. The DDMA group conducted a series of virtual
meetings with a diverse spectrum of the research community to exchange ideas, experiences and
knowledge and to coordinate analysis and development of a roadmap that addresses data
challenges for the geosciences community. The DDMA roadmap addresses current issues and
needs concerning data access, discovery and mining for the geosciences.
NASA’s Earth Science Data Systems Working Group (ESDSWG) [107] is focused on developing
ideas from community inputs for working with NASA Earth science data. Currently, ESDWG is
looking at a wide range of data issues, including cloud computing, collaborative environment, open
source software, visualization, HDF5 conventions, preservation and provenance.
Trends in international collaboration and/or international data sharing
There are an increasing number of international collaborations concentrating on exploiting the use
of science data. The International Council for Science: Committee on Data for Science and
Technology (CODATA) [108] is an international organization aimed at strengthening science,
with a particular emphasis on data management. Through working groups, conferences,
workshops and publications, CODATA brings together scientists and engineers to address many
of the issues related to data quality, accessibility, acquisition, management and analysis for a
variety of scientific disciplines.
The American Geophysical Union (AGU) [109], promoting discovery and collaboration in Earth
and space science, brings together scientists from over 140 countries. With over 60,000 members,
AGU hosts meetings in the spring and fall and publishes journals, books and articles to disseminate
scientific information. The AGU Earth and Space Science Informatics (ESSI) [110] focus group
addresses many issues related to data and information. Across the geosciences, common data
standards often mentioned at AGU meetings include Open Geospatial Consortium (OGC)
Standards [111], ESIP Federation Open Search [112], Open-source Project for a Network Data
Access Protocol (OPeNDAP) [113], Metadata (FGDC, ISO 19115 [114]), File Formats (HDF,
39
NetCDF [115]). The Federation of Earth Science Information Partners (ESIP) [116] is a
distributed community focused on Earth and environmental science data and information.
Through a networked community of stakeholders, the ESIP Federation drives innovation by
promoting collaboration to investigate issues concerning Earth science data interoperability.
The Integrated Research on Disaster Risk (IRDR) program [117] is a multinational research effort
investigating the challenges, preparedness, risk reduction and mitigation of natural disasters.
IRDR is taking an interdisciplinary approach to apply science to identify risks of disasters,
facilitate decision-making, and reduce risks from disasters across the world.
New technology that substantially reduces cost of collecting/retaining data
Cloud Computing can provide computing resources for research to those that may not have access
to computing resources at their facility. In addition to cost savings and increased IT agility, cloud
computing is currently being used to provide on-demand access for scientific research and also
helps to address issues of sharing, scalability and management. Likewise, the cost of large, spacebased and airborne science platforms to acquire data can be prohibitive. Efforts are underway to
provide smaller, less-expensive platforms to introduce more opportunities for research.
Difficulties due to insufficient network infrastructure also remain problematic.
While Open Data is desired for research both nationally and internationally, policies, rules and
regulations can inhibit the production and use of Open Data. Charging for data, copyrights and
patents forbidding re-use and proprietary technologies also hinder the use of Open Data.
New instruments, methods or use modalities that are scaling up current activities
The explosion of big data and advances in data technologies are introducing new and exciting
opportunities in interdisciplinary research. Researchers are combining science data with data from
social sciences, humanities, health and other fields to create new knowledge and advance discovery
in ways previously unimagined. Advances in semantics-based technologies are needed to address
the challenges associated with increasingly complex and distributed heterogeneous data. In
addition to traditional science data, researchers are now also working image, geospatial, text and
multimedia data. Data provenance is increasingly important as the complexity of data increases.
Metadata is essential to the discovery and preservation of data, and research in metadata can help
to address the challenges associated with data discovery, access and use. Scalable and
interoperable annotation, query and analysis can serve to exploit data in new ways. The interactive
exploration and visualization of large data sets with access to data is needed to accelerate
development of new techniques of working with data and advance scientific discovery.
Enabling effective collaborative research is also being explored. Collaborations can scale from
individuals sharing science resources, to sharing within groups such as science mission teams, to
sharing with an entire science community. New collaboration tools advance science by supporting
sharing of data, tools and results, enhancing productivity and enabling new areas of research.
Collaboration tools are needed to facilitate data search, access, analysis and visualization, as well
40
as to assist researchers with developing and sharing workflows, publishing results. Together, these
tools improve research capabilities and increase opportunities for the discovery of new knowledge.
Bottlenecks in creating, storing, or making better use of data
The ability to integrate, understand and manage data is a primary objective. While there are
numerous tools available, tools need to fit the way scientists perform their research. The scientist
may need to discover new datasets that they may not have considered before. Gathering relevant
data and information for case studies and climatology analysis is tedious and time consuming. The
need exists for tools to filter through large volumes of online content and gather relevant
information based on a user’s science needs. New methods of search include content-based
discovery of disparate and globally distributed datasets. For instance, the design of current Earth
Science data systems assumes researchers access data primarily by instrument or geophysical
parameter. But case study analysis and climatology studies commonly used in Atmospheric
Science research are instances where researchers are studying a significant event. Data organized
around an event rather than the observing instruments can provide results more targeted to the
scientist’s research focus.
Usability is an issue frequently encountered by scientists. To be effective, tools need to be easyto-use, efficient and effective. It is especially important to “lower the barrier to entry” for users
when using a tool for the first time. In addition to lowering the barrier to entry, usability
improvements to tools can increase productivity and allow for new, interdisciplinary science. The
use analysis tools on the data subset, and visualize and share the results can provide increased
capabilities for research.
Much of the scientist’s time is spent on data discovery, acquisition and preparation. While new
tools and techniques are needed for working with data, the use of specialized tools and techniques
is often outside the experience of scientists and can hinder the effective use of data. Easy-to-use,
sustainable and reusable tools can enable the scientist to overcome the complex challenges
associated with diverse, distributed and heterogeneous data and spend more time focusing on their
research. Validation of simulation data with observation data and in situ data is also needed.
Standards and interoperability play a fundamental role in working with data. The seamless
interplay of data and tools can simplify work for the scientist and increase productivity. Data
formats and communication protocols enabling the discovery, exchange and analysis of data is
essential, but determining which formats should be standard continues to be a challenge. user
interface continues to be an important area of research. For instance, a workbench environment
from which a scientist could find a data set, subset it to get just the region of interest,
With the increasing volume of data, I/O bottlenecks and scalability continue to be troublesome.
Computing has been improved with multi-core processors, but access to attached data storage can
be slow, and networks can introduce latency to data access times. Software such as Apache
41
Hadoop, with modules for scheduling, parallel processing and distributed file systems offer
promise for complex, large-scale research applications dealing with resource constraints.
New drivers to international data sharing in bottom-up way
Author, Beth Plale, Professor of Informatics and Computing and Director, Data to Insight Center,
Indiana
University
The Global Lake Ecological Observatory Network (GLEON) [118] is an example of an emerging
source of international traffic. Ecologists who received tenure conducting studies and data
gathering about a single lake or two formed an organization around the thesis that there was
research benefit to collaboration among people studying lakes. As the 10 years of GLEON
progressed, it became wildly successful in one important way; it created a generation of ecologists
who received PhDs on cross-lake research, that is, research questions that involve the study of
multiple lakes in diverse ecological settings, often across the world.
The GLEON organization/community facilitated reduction of the social barriers to data sharing
that previously hindered the larger-scale research.
GLEON is a leading exemplar of larger science emerging from the organization and community
building going on amongst smaller science researchers. Taken as a whole, this is a trend that will
continue. Research Data Alliance (RDA) [119;120] is seeing interest by these smaller scale
science groups as a venue for international community building to go on, and for progress on
technical issues in data interoperability to take place.
New drivers to international data sharing are also occurring in top-down way:

CoopEUS [121] is an EU/US joint funding to facilitate international cross-disciplinary
interoperability and data exchange. Lindsay Powers at NEON Inc. is CoopEUS Scientist
and represents EarthCube to the leadership council.

The Research Data Alliance (RDA) builds the social and technical bridges that enable open
sharing of data. Beth Plale is an RDA Steering Group Member

Group On Earth Observations Global Earth Observation System of Systems
(GEO/GEOSS) [122-124] is a ‘system of systems’ that will proactively link together
existing and planned observing systems around the world and support the development of
new systems where gaps currently exist. It will promote common technical standards so
that data from the thousands of different instruments can be combined into coherent data
sets.

The International Council for Science (ICSU) World Data System [125] promotes
universal and equitable access to, and long-term stewardship of, quality-assured scientific
42
data and data services, products, and information covering a broad range of disciplines
from the natural and social sciences, and humanities.

The Data Enabled Life Sciences Alliance (DELSA) [126] provides a leading voice and
coordinating framework for collective innovation in data-enabled science for the life
sciences community, and facilitates information exchange through workshop development,
community involvement and publications.
5.4.4. Data Trends in Computer Science
Author: Geoffrey Fox, Professor of Informatics and Computing and Physics Indiana University
Types of Application and Data
Computer Science covers a broad range of applications – some in domains outside computer
science but often performed in computer science. One example of this is “Network Science”
(sometimes called Web Science, Complex Systems or just Social networking) where researchers
largely from social science, computer science and physics work. Further, much work is
interdisciplinary with computer science and application science researchers working together.
For example, computer vision researchers could work with earth scientists on analysis of satellite
data, medical scientists on pathology and with CS artificial intelligence community on
interpreting images for robots and in particular self-driving (autonomous) vehicles. This
exemplifies the field of cyberphysical systems that is growing in importance and a major driver
of network traffic. A lot of current computer science areas are collaborative with Industry in
areas critical to today’s digital community world, including search, recommender engines, deep
learning (artificial intelligence), robotics, Human Centered Interfaces, and social networking. . In
addition, today’s cloud and enterprise data centers exhibit many distributed system and HPC
research discoveries while all is connected by networking that exploits new research
technologies like OpenFlow and is protected by Cybersecurity research. There is also substantial
CS research on military applications that builds on growing use of UAV’s and other sensors and
their analysis for command and control functions. Analysis of social media data is also of great
interest to intelligence community and another example of CS-defense interactions.
The size of commercial data is known with some useful numbers being ~6 zettabytes of shared
digital data today; 1.8 billion photos uploaded every day to social media sites like Facebook
[127], Instagram [128], WhatsApp [129] and Snapchat [130] (Flickr [131] is negligible); 24
billion devices on the Internet by 2020; annual IP traffic will pass one zettabyte/year mark in
2016 (18% business, rest consumer); YouTube [132] and Netflix [133] total around 50% of
network traffic in USA. The total data is growing roughly likes Moore’s law [134] but sub
categories like the “Internet of Things” are growing much faster than that.
Details of Applications
More detail can be found in 51 use cases gathered by the National Institute of Standards and
Technology (NIST) [135] at “Big Data Use Cases V1.0 Submission” [136] and “Draft NIST Big
Data Interoperability Framework: Volume 3, Use Cases and General Requirements” [137].
These cover industry, government and academic applications and show many computer science
43
areas that are presented in more detail than that above. There are also many science domain
applications in the NIST report, which are covered in other parts of this report.
Direct computer science research is covered in NIST use cases 6-8 (commercial search and
recommender systems), 13-15 (military), 16-25 (healthcare), 26-31 (Deep learning and Social
media), 32 (metadata technology), 10 and 49-51 (Internet of things).
The NIST report gives key characteristics of data with most current numbers smaller than a
petabyte for research use. The commercial cloud sites are 6 zettabytes total but other interesting
large cases include medical diagnostic imagery (70 petabytes in 2011 going over an Exabyte if
cardiology included) and military surveillance sensors gathering petabytes of data in a few hours.
The NIST benchmark repository (use case 31) for data analytics is currently 30TB but expected
to increase. There are many Twitter [138] research projects as it is possible to get good tweet
samples (through Gnip [139]) whereas other commercial sites are less forthcoming. Use case 28
estimates that it collects ~100 million messages per day, (~500GB data/day) increasing over time
for social media studies. This illustrates well the importance of streaming data. The content
comes from people (as in search and social media) and devices like the sensors in “smart” grids,
vehicles, rivers, cities and homes. Development of wearables (Google Glass [140] and Apple
Watch [140] are visible examples) that supplement smart phones with even now the Samsung
Galaxy S5 [141] having 10 distinct sensors, will increase interest in research into the Internet of
Things. Streaming applications like these are present in 80% of the NIST use cases and are
clearly changing system software (Apache Storm [142] is stalwart but Google MillWheel [143]
and Amazon Kinesis [144] are recent high profile announcements), network/computer use and
the spurring a re-examination of algorithms.
The scale of commercial problems is spurring a new type of research which sometimes gets bad
press. With A/B testing [145], new algorithms are well tested by just switching a tiny fraction of
say Netflix users to a new software or algorithm. That can be quickly compared with the rest of
world running the standard release. Successful trials are then incorporated in basic release and so
the so-called perpetual beta software of cloud applications continues.
5.5. FINDINGS: ONLINE CATALOG OF INTERNATIONAL BIG DATA SCIENCE
PROGRAMS
A survey of international big science programs was conducted; the resulting “Catalog of
International Big Data Science Programs” is available as an interactive web site at
http://irnc.clemson.edu. The site aggregates information gathered from interviews with network
providers, domain scientists, the NSF PI survey, web site searches, and the I2 Big Science List.
The web site provides information regarding over 100 international science collaborations
involving Big Data and large numbers of scientist (50 or more) through visualization of locations
and data size, and filtering by discipline or location. Below, a very few programs are described to
help paint a picture of the special application requirements in scientific research that drive the
need for R&E networks.
5.5.1. Large Hadron Collider : 15 PB/year (CERN)
44
The High Energy Physics (HEP) community has largely driven network backbone capacity to
current levels -- 40 -100G + (TB in Europe). HEP is the set of scientists who design, execute,
and analyze experiments requiring use of the LHC facility at CERN. There is only one such
place in the world, so those who want to participate in the experiments and results have over 20+
years formed an international, well-organized community around these experiments. The
computational people within the HEP community are computer scientists, programming
physicists, experimental equipment designers/operators & so forth. This community meets
regularly at the International Conference on Computing in High Energy and Nuclear Physics
(CHEP) [146]. The community shares the LHC instrument, located near Geneva and spanning
the Franco-Swiss border, about 100 meters underground. It is a particle accelerator used by
physicists to study the smallest known particles – the fundamental building blocks of all things.
Major experimental groups are the ATLAS Experiment [147], the Compact Muon Solenoid
(CMS) [148], and A Large Ion Collider Experiment (ALICE) [149].
The instrument is closed and will re-open in 2015.
The community is a collaboration of 10,000 scientists and engineers from over 100 countries,
organized around the six types of experiments at the LHC. Each experiment is distinct,
characterized by its unique particle detector.
Because the LHC is a scarce resource, experiments are scheduled years in advance (way past
2020) and the community already knows how much data will be produced and distributed as a
result of those experiments. The community has a well-established Tiered Model for data
distribution, currently more of an inherited situation than a technical necessity. Up until
recently, CalTech was responsible for planning and implementing network connections for HEP.
CERN (Tier 0) transfers data to “Tier 1” facilities (Brookhaven and Fermi National Labs); from
there, data is re-distributed to “Tier 2” facilities, located at universities. The community is
moving unevenly to more of a mesh; sites don’t need to specialize in one kind of data/one kind
of analysis. A MonaLisa [150] repository maintained at CalTech monitors all data flowing over
all relevant backbone segments. Traffic graphs show most data movement is Tier 2 to Tier 2
now. HEP currently maintains ~30PB of data (20PB of which are on-line accessible); in
comparison, Netflix has only 12TB of data and comprises 30% of all traffic on the commercial
Internet. HEP’s on-line data is used at an average of 0.8Tbps. The community’s expectations
are that a 5 minute delay is acceptable; bandwidth required is 2 Gbps average and 10 Gbps max.
1.1.1 The Daniel K. Inouye Solar Telescope (DKIST): 15 PB/year
The Daniel K. Inouye Solar Telescope (DKIST, formerly the Advanced Technology Solar
Telescope, ATST) [151] will address questions such as: What is the nature of solar magnetism;
how does that magnetism control our star; and how can we model and predict its changing outputs
that affect the Earth? The facility is located at Haleakala Observatories, Hawaii. Currently in
construction, DKIST represents a collaboration of 22 U.S. institutions, reflecting a broad segment
of the solar physics community.
Depending on final design of the instruments, there will be 5-15 petabytes of data per year. For
45
their mission, latency and timeliness are not that important. This translates to just under 5 Gbps
average over the course of a year needed to move the data from Maui to National Solar Observatory
HQ in Boulder.
1.1.2 Dark Energy Survey (DES): (1 GB image, 400 images per night, instrument steering).
The Dark Energy Survey (DES) [152] is designed to probe the origin of the accelerating universe
and help uncover the nature of dark energy by measuring the 14-billion-year history of cosmic
expansion with high precision. This collaboration is building an extremely sensitive 570Megapixel digital camera (DECam) and will mount it on the Blanco 4-meter telescope at Cerro
Tololo Inter-American Observatory high in the Chilean Andes.
Starting in Sept. 2012 and continuing for five years, DES will survey a large swath of the southern
sky out to vast distances in order to provide new clues to this most fundamental of questions.
More than 120 scientists from 23 institutions in the United States, Spain, the United Kingdom,
Brazil, and Germany are working on the project. The DES-Brazil is a member of the international
cooperation Dark Energy Survey (DES) led by Fermilab, NCSA and NOAO.
Each DECam image is a gigabyte in size. The Dark Energy Survey will take about 400 of these
extremely large images per night. This presents a very high data-collection rate for an astronomy
experiment. The data are sent via a microwave link to La Serena. From there, an optical link
forwards them to the National Center for Supercomputer Applications (NCSA) in Illinois for
storage and "reduction". The project to date requires small file transfer and occasional large file
transfers at speeds between 100Mpbs to 1 Gbps. Reduction consists of standard image corrections
of the raw CCD information to remove instrumental signatures and artifacts and the joining of
these images into 0.5 square degree "combined images". Then galaxies and stars in the images are
identified, catalogued, and finally their properties measured and stored in a database.
1.1.3 Large Square Kilometer Array (Australia & South Africa) (100 PB per day)
The Square Kilometer Array (SKA) will be the world’s largest and most sensitive radio
telescope, addressing fundamental unanswered questions about our Universe including the first
stars and galaxies formed after the Big Bang, how galaxies have evolved since then, the role of
magnetism in the cosmos, the nature of gravity, and the search for life beyond Earth. The SKA is
being built in Australia and South Africa. The total collecting area of the SKA will be one square
kilometer, or 1,000,000 square meters. This will make the SKA the largest radio telescope array
ever constructed, by some margin. To achieve this, the SKA will use several thousand dish (high
frequency) and many more low frequency and mid-frequency aperture array telescopes, with the
several thousand dishes each being 15 metres in diameter. Rather than just clustered in the
central core regions, the telescopes will be arranged in multiple spiral arm configurations, with
the dishes extending to vast distances from the central cores, creating what is known as a long
baseline interferometer array. Construction Phase is 2016-2024. Prototype systems will begin
operation in 2016. The instrument is expected to produce 100 PetaBytes of data per day.
Collaborators are from neighboring African countries, Australia, New Zealand, Canada, China,
Germany, Italy, Netherland, Sweden, and India).
46
5.6. FINDINGS: SURVEY OF NSF-FUNDED PIS
An on-line survey of NSF funded Principal Investigators whose projects’ initial funding
commenced in years 2009-2014 was conducted in August 2014. A total of 4,050 PIs
responded, representing a response rate of 13%, and over half of those who responded participate
in one or more international science collaborations (Figure 20 A). The representation of NSF
program areas represented among the respondents is shown in 20 B.
Over half of respondents participate in international science collaborations
14%
Math and Physical Sciences
Biological Sciences
Social, Behavioral and Economic…
Computer & Information Science
GeoScience
Engineering
86%
A Education & Human Resource…
0
B 200
400
600
800 1000 1200
FIGURE20.AMAJORITYOFNSFFUNDEDINVESTIGATORSPARTICIPATEINONEORMOREINTERNATIONAL
COLLABORATIONS
The complete survey questions can be found in Appendix B, and survey results can be found in
Appendix C. The top ten countries where shared data are created, processed, or stored (multiple
FIGURE21.TOPTENCOUNTRIESSOURCINGDATA
47
countries could be selected by each respondent). Countries where shared data is created,
processed, or stored (multiple countries could be selected by each responder) are shown in
Figure 21. Of the 196 countries listed in the survey, 179 were named by respondents.
Only 12% of respondents indicated that they had difficulty collaborating due to network
limitations; for those who had this problem, the top countries mentioned are indicated in Figure
22.
FIGURE22.COUNTRIESWHEREUSINVESTIGATORSEXPERIENCENETWORKINFRASTRUCTUREISSUES
48
6. REFERENCES
[1]
"Large Synoptic Survey Telescope (LSST)," 2015. http://www.lsst.org/lsst/
[2]
"Atacama Large Millimeter/submillimeter Array (ALMA)," 2015.
http://www.almaobservatory.org/
[3]
"Square Kilometer Array (SKA Telescope)," 2015. https://www.skatelescope.org/
[4]
"James Webb Space Telescope (JWST)," 2015. http://www.jwst.nasa.gov/
[5]
"Large Hadron Collider (LHC)," 2015. http://home.web.cern.ch/topics/large-hadron-collider
[6]
"International Thermonuclear Experimental Reactor (ITER)," 2015. http://www.iter.org/
[7]
"Belle Detector Collaboration," 2015. http://belle2.kek.jp/detector.html
[8] National Oceanic and Atmospheric Administration, "What is LIDAR?," 2015.
http://oceanservice.noaa.gov/facts/lidar.html
[9]
[10]
"GLORIAD Insight," 2015. https://insight.gloriad.org/insight/
"Domain Name System," 2015. http://en.wikipedia.org/wiki/Domain_Name_System
[11] P.Mockapetris, "Domain Names - Concepts and Facilities (IETF RFC 1034)," 1987.
https://www.ietf.org/rfc/rfc1034.txt
[12]
"Global Integrated Lambda Facility (GLIF)," 2014. http://www.glif.is/
[13]
"Border Gateway Protocol 4 (BGP-4),". Y.Rekhter, T.Li, and S.Hares, Eds. 2006.
https://tools.ietf.org/html/rfc4271
[14]
"Networking 101: Understanding BGP Routing," 2015.
http://www.enterprisenetworkingplanet.com/netsp/article.php/3615896/Networking-101Understanding-BGP-Routing.htm
[15]
"Software Defined Networking (SDN)," 2013. http://en.wikipedia.org/wiki/Softwaredefined_networking
[16]
"Center for Applied Internet Data Analysis (CAIDA)," 2014. http://www.caida.org/home/
[17]
"ACE/TransPac3," 2013. http://internationalnetworking.iu.edu/ACE
[18]
"Delivery of Advanced Network Technology for Europe (DANTE)," 2015.
http://www.dante.net/Pages/default.aspx
[19]
"Trans-European Research and Education Networking Association (TERENA)," 2015.
https://www.terena.org/
49
[20]
"New York State Education and Research Network (NYSERNet)," 2015.
https://www.nysernet.org/
[21]
"Internet2," 2014. http://www.internet2.edu/
[22]
"AMLight," 2013. http://www.amlight.net/
[23]
"Latin American Cooperation of Advanced Networks (RedCLARA)," 2015.
http://www.redclara.net/index.php/en/
[24]
"U.S. Department of Energy, Energy Sciences Network (ESnet)," 2015. http://www.es.net/
[25]
"Canada's Advanced Research and Innovation Network (CANARIE)," 2015.
http://www.canarie.ca/
[26]
"GLORIAD ," 2013. http://www.gloriad.org/gloriaddrupal/
[27]
"National Research Centre "Kurchatov Institute"," 2015. http://www.nrcki.ru/e/engl.html
[28]
"Korea Institute of Science and Technology Information (KISTI)," 2015. http://en.kisti.re.kr/
[29]
"Chinese Academy of Sciences," 2015. http://english.cas.cn/
[30]
"SURFnet," 2015. https://www.surf.nl/en/about-surf/subsidiaries/surfnet
[31]
"Nordic Infrastructure for Research and Education (NORDUnet)," 2015.
https://www.nordu.net/
[32]
"IceLink TransAtlantic Polar Network," 2015.
https://ftp.nordu.net/ndnweb/nordunet___canarie_announces_icelink.html
[33]
"Hong Kong Open Exchange Portal (HKOEP) ," 2015.
http://www.nren.nasa.gov/workshops/pdfs9/PanelD_OpticalTestbedsinChina-Haina.pdf
[34]
"Egyptian National Scientific and Technical Information Network (ENSTINET)," 2015.
https://www.b2match.eu/scienceforsociety/participants/10
[35]
"Telecomm Egypt (TEGROUP)," 2015. http://www.itu.int/net4/ITUD/CDS/SectorMembersPortal/index.asp?Name=42883
[36]
"Tata Communications," 2015. http://www.tatacommunications.com/
[37]
"National Knowledge Network (NKN)," 2015. http://www.nkn.in/
[38]
"Singapore Advanced Research and Education Network (SingAREN)," 2015.
http://www.singaren.net.sg/
[39]
"Vietnam Research and Education Network (VinAREN)," 2015. http://en.vinaren.vn/
[40]
"TransLight / Pacific Wave," 2013. http://www.hawaii.edu/tlpw/
[41]
"TransPac," 2014. http://internationalnetworking.iu.edu/initiatives/transpac3/index.html
50
[42]
"Asia Pacific Advanced Network (APAN)," 2015. http://www.apan.net/
[43]
"TransLight/StarLight," 2014. http://www.startap.net/translight/pages/about-home.html
[44]
" TCP/IP Protocol Suite ," 2015. http://en.wikipedia.org/wiki/Internet_protocol_suite
[45] Eli Dart, "The Science DMZ," 2015.
http://www.internet2.edu/presentations/ccnie201404/20140501-dart-sciencedmz-philosophyv3.pdf
[46]
"Network Startup Resource Center," 2014. http://www.nsrc.org/
[47]
"Network Service Interface (NSI)," 2015. https://redmine.ogf.org/projects/nsi-wg
[48]
"Open Grid Forum (OGF)," 2015. http://www.gridforum.org/
[49]
"Global Environment for Network Innovations (GENI)," 2015. http://www.geni.net/
[50]
"Advanced Layer 2 Services (ALS2)," 2015. http://www.internet2.edu/productsservices/advanced-networking/layer-2-services/layer-2-services-details/
[51]
" Global Telecommunications System (GTS)," 2015.
http://www.wmo.int/pages/prog/www/TEM/GTS/index_en.html
[52]
"HiSeasNet," 2015. http://hiseasnet.ucsd.edu/
[53]
"NASA Space Network," 2015.
http://www.nasa.gov/directorates/heo/scan/services/networks/txt_sn.html
[54]
"Hubble Space Telescope," 2015. http://hubblesite.org/
[55]
"NASA's Earth Observing System," 2015. http://eospso.nasa.gov/
[56]
"NASA International Space Station," 2015. http://eospso.nasa.gov/
[57]
"Internet2 International Big Science List," 2015.
https://www.internet2.edu/media/medialibrary/2014/01/22/The_International_Big_Science_List.p
df
[58]
"UDP-Based Data Transfer (UDT)," 2015. http://udt.sourceforge.net/
[59] Nick McKeown, tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer
Rexford, Scott Shenker, and Jonathan Turner, "OpenFlow: enabling innovation in campus
networks," SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, pp. 69-74, 2018.
[60]
"OpenFlow," 2015. http://archive.openflow.org/wp/learnmore/
[61]
"Square Kilometer Array (SKA)," 2015. https://www.skatelescope.org/
[62]
"perfSONAR PERFormance Service Oriented Network monitoring ARchitecture ," 2013.
http://www.perfsonar.net/
51
[63]
"CAVEwave," 2015. https://www.evl.uic.edu/entry.php?id=580
[64]
"CineGrid," 2015. http://www.cinegrid.org/
[65] "GLORIAD "Insight"," 2015. https://insight.gloriad.org/insight/ (live and recent data)
https://insight.gloriad.org/history/ (historical data)
[66]
"Chiina Science and Technology Network (CSTnet)," 2015.
http://indico.cern.ch/event/199025/session/2/contribution/42/material/slides/0.pdf
[67]
"File Transfer Protocol (FTP)," 2015. http://en.wikipedia.org/wiki/File_Transfer_Protocol
[68] J.Postel and J.Reynolds, "File Transfer Protocol (FTP)," 1985.
https://www.ietf.org/rfc/rfc959.txt
[69]
"Aspera," 2015. http://asperasoft.com/
[70]
"Internet Protocol, Version 6 (IPV6) Specification," 2015. https://www.ietf.org/rfc/rfc2460.txt
[71]
"IPV6," 2015. http://en.wikipedia.org/wiki/IPv6
[72]
"IPv4," 2015. http://en.wikipedia.org/wiki/IPv4
[73] Information Sciences Institute, "Internet Protocol (IPv4),". J.Postel, Ed. 1981.
https://tools.ietf.org/html/rfc1349
[74]
"Experimental Program to Stimulate Competitive Research (EPSCoR)," 2015.
http://www.nsf.gov/od/iia/programs/epscor/index.jsp
[75] Von Welch, Doug Pearson, Brian Tierney, and James Williams, "Security at the Cyberborder
Workshop Report: Exploring the relationship of International Research network Connections and
CyberSecurity," Mar.2012.
[76] KC Claffy and Josh Polterock, "International Research Network Connections: Usage and Value
Measurement,"May2013.
[77]
"XSEDE," 2013. https://www.xsede.org/
[78]
"XDMoD: Comprehensive HPC System Management Tool,"2015.
[79]
"Advanced Scientific Computing Research Network Requirements," Office of Advanced
Scientific Computing Research, DOE Office of Science, Energy Sciences Network,LBNL report
LBNL-6109E, Oct.2012.
[80] Willaim Johnston and Eli Dart, "ESnet capacity requirements: Projections based on historical
traffic growth and on documented science requirements for future capacity," 2015.
[81]
"High Energy Physics and Nuclear Physics Network Requirements - Final Report,"LBNL
6642E, Aug.2013.
[82]
"BES (Basic Energy Sciences) Network Requirements Workshop, September 2010 - Final
Report,"2010.
52
[83]
"BER (Biological and Environmental Research) Network Requirements Workshop, April 2010 Final Report,"Apr.2010.
[84]
"Belle-II Experiment Network Requirements Workshop,"2012.
[85] E.Dart, L.Rotman, B.Tierney, M.Hester, and J.Zurawski, "The Science DMZ: A network design
pattern for data-intensive science," 2013.
[86]
"Cisco Visual Networking Index: Forecast and Methodology, 2013-2018," 2014.
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generationnetwork/white_paper_c11-481360.pdf
[87] A.Odlyzko, "Minnesota Internet Traffic Studies (MINTS) ," 2015.
[88] A.Colby, "AT&T, NETC, Corning complete record-breaking fiber capacity test," 2009.
http://news.soft32.com/att-nec-corning-complete-record-breaking-fiber-capacity-test_7372.html
[89]
"Annual Cisco Visual Networking Index Forecast Projects Global IP Traffic to Increase More
Than Fourfold by 2014," 2010. http://newsroom.cisco.com/dlls/2010/prod_060210.html
[90] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide, and Farnam Jahanian,
"Internet inter-domain traffic," SIGCOMM Comput. Commun. Rev., vol. 40, no. 4, pp. 75-86,
2010.
[91]
"A Primer on the FITS Data Format," 2015. http://fits.gsfc.nasa.gov/fits_primer.html
[92]
"HDF5," 2015. http://www.hdfgroup.org/HDF5/
[93]
"National Radio Astronomy Observatory (NRAO)," 2015. http://www.nrao.edu/
[94]
"Structured Query Language (SQL)," 2013. http://en.wikipedia.org/wiki/SQL
[95]
"NoSQL," 2015. http://en.wikipedia.org/wiki/NoSQL
[96]
"Millenium Simulation Database," 2015. http://www.mpagarching.mpg.de/galform/millennium-II
[97]
"Bolshoi Simulation Multidark Database," 2015. http://www.multidark.org/MultiDark
[98]
"Dark Sky Simulations: Early Data Release," 2015. http://arxiv.org/abs/1407.2600
[99]
"National Center for Supercomputing Applications (NCSA)," 2013.
http://www.ncsa.illinois.edu/
[100]
"High Performance Storage System (HPSS)," 2015.
http://en.wikipedia.org/wiki/High_Performance_Storage_System
[101]
"Redundant Array of Inexpensive Disks (RAID)," 2015. http://en.wikipedia.org/wiki/RAID
[102]
"Blue Waters," 2015. http://www.ncsa.illinois.edu/enabling/bluewaters
53
[103]
"Lustre File System," 2015. http://www.cse.buffalo.edu/faculty/tkosar/cse710/papers/lustrewhitepaper.pdf
[104]
"iPlant Collaborative," 2015. http://www.iplantcollaborative.org/
[105]
"EarthCube," 2015. http://earthcube.org/
[106]
"EarthCube Data Discovery Access and Mining (DDMA)," 2015.
https://sites.google.com/site/earthcubeddma/
[107]
"NASA Earth Science Data System Working Groups (ESDSWG)," 2015.
https://earthdata.nasa.gov/esdswg
[108]
"International Council for Science : Committee on Data for Science and Technology
(CODATA)," 2015. http://www.codata.org/
[109]
"American Geophysical Union (AGU)," 2015. http://sites.agu.org/
[110]
"American GeoPhysical Union Earth and Space Science Informatics Focus Group (AGU ESSI),"
2015.
[111]
"Open Geospatial Consortium Standards (OGC)," 2015.
http://www.opengeospatial.org/standards
[112]
"ESIP Federated Open Search," 2015. http://wiki.esipfed.org/index.php/HowTo_Guide_for_Implementing_ESIP_Federated_Search_Servers
[113]
"Open-source Project for a Network Data Access Protocol (OPeNDAP)," 2015.
http://www.opendap.org/
[114]
"Federal Geographic Data Committee (FGDC) ISO 19115," 2015.
http://www.fgdc.gov/metadata/geospatial-metadata-standards
[115]
"Network Common Data Format (NetCDF)," 2015.
http://www.unidata.ucar.edu/software/netcdf/
[116]
"Federation of Earth Science Information Partners (ESIP)," 2015. http://www.esipfed.org/
[117]
"International Council for Science Integrated Research on Disaster Risk (IRDR)," 2015.
http://www.icsu.org/what-we-do/interdisciplinary-bodies/irdr
[118]
"Global Lake Ecological Observatory Network (GLEON)," 2015. http://www.gleon.org/
[119]
"Research Data Alliance (RDA)," 2015. https://rd-alliance.org/
[120] Fran Berman, Ross Wilkinson, and John Wood, "Building Global Infrastructure for Data Sharing
and Exchange Through the Research Data Alliance,", 20(1/2) 2014.
http://mirror.dlib.org/dlib/january14/01guest_editorial.print.html
[121]
"CoopEUS," 2015. http://www.coopeus.eu/
54
[122]
"Group On Earth Observations Global Earth Observation System of Systems (GEO/GEOSS),"
2015. http://www.earthobservations.org/geoss.php
[123] M. L. Butterfield, J. S. Pearlman, and S. C. Vickroy, "A System-of-Systems Engineering GEOSS:
Architectural Approach," Systems Journal, IEEE, vol. 2, no. 3, pp. 321-332, 2008.
[124] Siri Jodha Singh Khalsa, Stefano Nativi, and Gary N.Geller, "The GEOSS Interoperability
Process Pilot Project (IP3)," IEICE Transactions on Geoscience and Remote Sensing, vol. 47, no.
1 2009.
[125]
"International Council for Science (ICSU) World Data System," 2015. https://www.icsuwds.org/
[126]
"Data Enabled Life Sciences Alliance (DELSA)," 2015. http://www.delsaglobal.org/
[127]
"Facebook," 2015. http://www.facebook.com/
[128]
"Instagram," 2015. http://instagram.com/
[129]
"WhatsApp," 2015. https://www.whatsapp.com/
[130]
"snapchat," 2015. https://www.snapchat.com/
[131]
"Flickr," 2015. https://www.flickr.com/
[132]
"YouTube," 2015. https://www.youtube.com/
[133]
"Netflix," 2015. https://www.netflix.com/us/
[134]
"Moore's Law," 2015. http://en.wikipedia.org/wiki/Moore%27s_law
[135]
"National Institute of Standards and Technology (NIST)," 2015. http://www.nist.gov/
[136]
"NIST Big Data Program Working Group (NBD-PWG)," 2015.
http://bigdatawg.nist.gov/usecases.php
[137] U. C. a. R. S. NIST Big DataPublicWorking Group, "DRAFT NIST Big Data Interoperability
Framework: Volume 3, Use Cases and Requirements," 2014.
http://bigdatawg.nist.gov/_uploadfiles/BD_Vol3-UseCaseGenReqs_V1Draft_Pre-release.pdf
[138]
"twitter," 2015. https://twitter.com/?lang=en
[139]
"Gnip," 2015. https://gnip.com/
[140]
"Google Glass," 2015. https://www.google.com/glass/start/
[141]
"Samsung Galaxy S5," 2015.
http://shop.sprint.com/mysprint/shop/phone_details.jsp?ensembleId=SPHG900BKS&flow=AAL
&isDeeplinked=true&NoModal=true&defaultContractTerm=subsidy&ECID=SEM:Google:P:201
4_Q4_OEM:OEM_Samsung_Galaxy_S5_NonBrand:Galaxy_S5_Core_Exact:samsunggalaxys5:
Exact&gclid=CIvmyrWykcMCFSgQ7Aod100AnQ&gclsrc=aw.ds
55
[142]
"Apache Storm," 2015. https://storm.apache.org/
[143] Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chemyak, Josh Haberman, Reuven Lax, Sam
McVeety, Daniel Mill, Paul Nordstrom, and Sam Whittle, "MillWheel: Fault-Tolerant Stream
Processing at Internet Scale," in Very Large Data Bases 2013, pp. 734-746.
[144]
"Amazon Kinesis," 2015.
[145]
"A/B Testing," 2015. http://en.wikipedia.org/wiki/A/B_testing
[146]
"International Conference on Computing in High Energy and Nuclear Physics (CHEP)," 2015.
http://chep2015.kek.jp/about.html
[147]
"ATLAS Experiment," 2015. http://atlas.ch/
[148]
"Compact Muon Solenoid (CMS)," 2015. http://home.web.cern.ch/about/experiments/cms
[149]
"A Large Ion Collider Experiment (ALICE)," 2015. http://aliceinfo.cern.ch/
[150]
"MONitoring Agents using a Large Integrated Services Architecture (MonaLisa)," 2015.
http://monalisa.caltech.edu/monalisa.htm
[151]
"Daniel K. Inouye Solar Telescope (DKIST)," 2015. http://dkist.nso.edu/
[152]
"Dark Energy Survey (DES)," 2015. http://www.darkenergysurvey.org/
56
7. APPENDICES
57
7.1. APPENDIX A: INTERVIEW WITH NETWORK AND EXCHANGE POINT
OPERATORS
Discussion Topics for IRNC PIs, Network Engineers, Administrators, Operators
Infrastructure: 1.
2.
3.
4.
5.
6.
7.
8.
9.
Whichnetwork(s)doyousupport?
Doyouhaveinfluenceonthedesign/architectureofthenetwork(s)yousupport?
Whichnetworkstoyoupeerwith?
Whatareyourpreferredpaths/networks?
Whatisthecurrentnetworkcapacity?
Doyouhaveplanstoincreasecapacity?
Whatisyournetworkredundancyplan?
Whatdoyouusetomonitoryournetwork?
Arethereadditionalwaysyouwouldliketomonitoryournetwork?
Science Community Interaction: 10. Whatdoyouviewasthecurrenttopthreedatadriversacrossyournetwork?
11. Whatwouldyouexpectthetopthreedatadriverstobein2020?
12. Whatisthecurrentaveragedataflowacrossyournetwork?Whatisthehighestyouhave
seenitspiketo?Howoftenandforhowlongdospikesoccur?
13. Dolargedatasetstypicallyoriginateonsiteandgoout,ororiginateoffsiteandcomein?Or
both?Doyouexpectthispatterntocontinue,reverse,orevenout?
14. Howdoremoteusersaccesslargedatasetsonyournetwork?(Whatarethesteps?)
15. Howoftendoyouinteractwithresearchers/scientistsregardingtheirnetworkneeds?
16. Wheninteractingwithresearchers/scientistsregardingtheirnetworkneeds,whatisthe
mostusedmodeofcommunication?(Facetoface,email,phone,instantmessenger,
videoconference,etc.?)
17. Arethepowerusersofyournetworkfromyourorganization/agency,orothers?
Staff Support: 18. Howmanypeoplesupportthenetworkcurrently?Doyoufeelthecurrentstaffinglevels
aresufficient?
19. Whatisthemostcommontroubleshootingissueyou/yourteamdealwithwhenlargedata
setsaretraversingthenetwork,includingacrosstheWAN?
20. Doyouinteractwithnetworkengineersfromotheragencies/organizationsaboutthe
network?
21. Whatcriticalskillssetswillfuturenetworkengineersneedtopossess?
Future Expectations: 22. Forexchangepointoperators:Whatdoyouthinkabouteffectivenessoftheinternational
exchangepointprogram?
58
23. Doyouenvisionanynetworktechnologiesresurfacingandgettingwidelydeployed?Ifso,
pleaselistthemandexplainwhy.
24. Doyouenvisiondataneedsdrivinginternationalnetworkstobemoreorlessconcentrated
ondedicatedend‐to‐endcircuits?
25. Whatcriteriadoyoufeelshouldbeusedtoprojectthefutureneedsofinternational
networks?
26. Howdoyouthinkthiswillaffectstaffingneedsin2020?
27. Whatchangeswouldyouliketoseeacrossinternationalnetworksinthenextfewyears?
Pleaselistanyareasthatapply:technical,administrative,governance,etc.
28. Whoneedstomakethesechanges?
59
7.2. APPENDIX B: ON-LINE SURVEY FOR NSF-FUNDED PIS
International Collaborations in Science - Data Survey
Introduction
This survey is part of an National Science Foundation funded study (Award Number:1223688) examining
growth in the amount and types of data moving across the NSF-funded international Research &
Education network infrastructure.
The purpose of this survey is to gather information regarding where data is being produced and stored
for international science collaborations. The time period of interest is from the present through 2020.
Your response will help inform capacity planning for the National Science Foundation’s International
Research Network Connections (IRNC) program. Thereareonly5briefquestionstoanswer.
This material does not necessarily reflect the views of the National Science Foundation.
*Question 1: Do you participate in a project with international collaborators or using a science
instrument located in another country?
m
j Yes
l
m
j No
l
International Collaboration Project Information
60
Provide information about the collaboration or instrument, below. If you
participate in more than one international collaboration, please describe the
project having the largest amount of data. (please include whatever information
you have, and leave the rest blank)
Project
Name
Project
URL
Approximate
number
61
International Collaborations in Science - Data Survey
*For the collaboration project/instrument described above, please check ALL countries where the
data is created, processed, or stored. (Note - this question is about the locations of the data,
not the location of every collaborator. Multiple choice is allowed.)
c
e
f
*
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
Afghanistan
Albania
Algeria
Andorra
Angola
Antigua & Deps
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
c Grenada
e
f
* Guatemala
e
f
c Guinea
e
f
c Guinea-Bissau
e
f
c Guyana
e
f
c Haiti
e
f
c Honduras
e
f
c Hungary
e
f
c Iceland
e
f
c India
e
f
c Indonesia
e
f
c Iran
e
f
c Iraq
e
f
c Ireland {Republic}
e
f
c Israel
e
f
c Italy
e
f
c Ivory Coast
e
f
c Jamaica
e
f
62
 Pakistan
e
f
* Palau
e
f
c Panama
e
f
c Papua New Guinea
e
f
c Paraguay
e
f
c Peru
e
f
c Philippines
e
f
c Poland
e
f
c Portugal
e
f
c Qatar
e
f
c Romania
e
f
c Russian Federation
e
f
c Rwanda
e
f
c St Kitts & Nevis
e
f
c St Lucia
e
f
c Saint Vincent & the
e
f
c Samoa
e
f
c San Marino
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
Benin
Bhutan
Bolivia
Bosnia Herzegovina
Botswana
Brazil
Brunei
Bulgaria
Burkina
Burundi
Cambodia
Cameroon
c Japan
e
f
c Sao Tome & Principe
e
f
c Jordan
e
f
c Saudi Arabia
e
f
c Kazakhstan
e
f
c Senegal
e
f
c Kenya
e
f
c Serbia
e
f
c Kiribati
e
f
c Seychelles
e
f
c Korea North
e
f
c Sierra Leone
e
f
c Korea South
e
f
c Singapore
e
f
c Kosovo
e
f
c Slovakia
e
f
c Kuwait
e
f
c Slovenia
e
f
c Kyrgyzstan
e
f
c Solomon Islands
e
f
c Laos
e
f
c Somalia
e
f
c Latvia
e
f
c South Africa
e
f
63
International Collaborations in Science - Data Survey
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
Canada
Cape Verde
Central African Rep
Chad
Chile
China
Colombia
Comoros
Congo
Congo {Democratic
Costa Rica
Croatia
Cuba
Cyprus
Czech Republic
Denmark
Djibouti
Dominica
Dominican Republic
East Timor
Ecuador
Egypt
c Lebanon
e
f
c South Sudan
e
f
c Lesotho
e
f
c Spain
e
f
c Liberia
e
f
c Sri Lanka
e
f
c Libya
e
f
c Sudan
e
f
c Liechtenstein
e
f
c Suriname
e
f
c Lithuania
e
f
c Swaziland
e
f
c Luxembourg
e
f
c Sweden
e
f
c Macedonia
e
f
c Switzerland
e
f
c Madagascar
e
f
c Syria
e
f
c Malawi
e
f
c Taiwan
e
f
c Malaysia
e
f
c Tajikistan
e
f
c Maldives
e
f
c Tanzania
e
f
c Mali
e
f
c Thailand
e
f
c Malta
e
f
c Togo
e
f
c Marshall Islands
e
f
c Tonga
e
f
c Mauritania
e
f
c Trinidad & Tobago
e
f
c Mauritius
e
f
c Tunisia
e
f
c Mexico
e
f
c Turkey
e
f
c Micronesia
e
f
c Turkmenistan
e
f
c Moldova
e
f
c Tuvalu
e
f
c Monaco
e
f
c Uganda
e
f
c Mongolia
e
f
c Ukraine
e
f
64
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c
e
f
c Montenegro
e
f
c United Arab Emirates
e
f
c Morocco
e
f
c United Kingdom
e
f
c Mozambique
e
f
c United States
e
f
c Myanmar, {Burma}
e
f
c Uruguay
e
f
c Namibia
e
f
c Uzbekistan
e
f
c Nauru
e
f
c Vanuatu
e
f
c Nepal
e
f
c Vatican City
e
f
c Netherlands
e
f
c Venezuela
e
f
c New Zealand
e
f
c Vietnam
e
f
c Nicaragua
e
f
c Yemen
e
f
Georgia
fc Germany
e
f Niger
c
e
c Nigeria
fe
f Zambia
c
e
c Zimbabwe
fe
c Ghana
fe
c Norway
fe
c Greece
fe
c Oman
fe
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Fiji
Finland
France
Gabon
Gambia
65
New Instruments
*Question 2: Do your research plans involve the use of any new, large-scale scientific
instrumentation that will begin or complete construction by the year 2020?
m
j Yes
l
m
j No
l
New Instrument Information
Tell us about the new instrument. (Please include the information you know, and leave the rest
blank)
Name of instrument or project
building the
instrument
URL describing the instrument
66
International Collaborations in Science - Data Survey
*Please check ALL countries where the instrument will be located.
(same as previous list of countries as Q1)
Other Impact Factors
*Question 3: Do you know of any activities or technology innovations that will significantly scale
up the amount of data being collected, created or stored for use across international boundaries
by the year 2020?
(for example: dramatic reduction in cost of producing data, new centralization of previously scattered
data, or other drivers)
m
j Yes
l
m
j No
l
If Yes please describe the activity/innovation that will scale up the amount of data being produced
_________________________________________________________________________________________________________
Barriers to Collaboration
67
*Question 4: Are there currently any countries where Internet access is so limited that you are
having difficulties collaborating with others?
m
j Yes
l
m
j No
l
International Collaborations in Science - Data Survey
*Please select one or more countries where improved network connectivity would significantly impact
your collaboration plans though 2020.
(same as previous list of countries as Q1)
68
*Question 5: Please indicate your primary science
knowledge domain
m
j Math and Physical Sciences
l
m
j Biological Sciences
l
m
j GeoScience
l
m
j Computer & Information Science
l
m
j Social, Behavioral and Economic Science m
l
l
j Education & Human Resource
Development m
l
j Engineering
69
7.3. APPENDIX C: SUMMARY OF RESPONSES TO NSF PI SURVEY
An on-line survey of NSF funded Principal Investigators whose projects’ initial funding
commended in years 2009-2014 was conducted in July and August 2014. A total of 4,050 PIs
responded, representing a response rate of 13%.
QUESTION 1:
Do you participate in a project with international collaborators or using a science
instrument located in another country?
Yes – 56%
No - 44%
Those who answered “No” were immediately taken to the end of the survey.
The remaining questions were answered by the 2227 people who answered “Yes”.
Provide information about the collaboration or instrument, below. If you participate in
more than one international collaboration, please describe the project having the
largest amount of data.
This question was answered by 1208 responders, with each listing only one project. Data
collected included project name, URL, and approximate number of collaborators.
Projects having 100 or more participants are included in the "Catalog of International Big
Data Science Programs" web site at http://irnc.clemson.edu/
For the collaboration project/instrument described above, please check ALL countries
where the data are created, processed, or stored. (Note - this question is about the
locations of the data, not the location of every collaborator. Multiple choice is allowed.)
179 countries were mentioned by 1667 responders; respondents could select as
many locations as they wanted to. The Table on the next page below shows countries
where data are created, processed or stored mentioned by at 50 or more respondents.
70
Country Count Country Count Country Count United States 606 Brazil 127 New Zealand 72 Germany 330 Spain 125 Mexico 71 United Kingdom 321 India 114 Russian Federation 71 France 270 Netherlands 103 Denmark 67 China 192 Switzerland 92 South Africa 62 Canada 177 Korea South 87 Argentina 58 Japan 170 Israel 83 Belgium 58 Australia 156 Sweden 77 Finland 58 Italy 139 Chile 72 Poland 57 QUESTION 2:
Do your research plans involve the use of any new, large-scale scientific
instrumentation that will begin or complete construction by the year 2020?
This question was answered by 3393 respondents. 12% responded with YES, 88%
responded with NO.
When asked for the name/URL of the new instrument, 237 persons responded. These
responses are included in the "Catalog of International Big Data Science Programs" web
site at http://irnc.clemson.edu/
When asked to indicate in which country the new instrument would be located, 381
persons responded, listing 89 countries. Countries mentioned by 10 or more respondents
are listed below:
Response Count 237 38 30 29 25 22 21 17 17 16 15 12 Answer Options United States Chile Germany France United Kingdom Japan China Australia Switzerland Canada Italy India 71
QUESTION 3:
Do you know of any activities or technology innovations that will significantly scale up
the amount of data being collected, created or stored for use across international
boundaries by the year 2020? (for example: dramatic reduction in cost of producing
data, new centralization of previously scattered data, or other drivers)
This question was answered by 3286 respondents. Twenty-two percent (723 people)
responded YES, and 707 provided some supporting text. The responses are summarized
by primary scientific discipline as WORDLs, below.
FIGURE23.MATH&PHYSICALSCIENCES
72
FIGURE24.BIOLOGICALSCIENCES
FIGURE25.GEOSCIENCES
73
FIGURE26.COMPUTERSCIENCES
FIGURE27.SOCIALANDBEHAVIORALSCIENCES
74
FIGURE28.EDUCATION,OUTREACH,ANDHUMANDEVELOPMENT
QUESTION 4:
Are there currently any countries where Internet access is so limited that you are
having difficulties collaborating with others?
A total of 3,272 persons responded, with 12% saying YES. Countries mentioned by 10 or
more persons are listed below.
Answer Options Response Count Answer Options Response Count China 62 Mexico
14 United States 29 South Africa
14 India 26 Uganda
14 Ghana 24 Kenya 19 Congo {Democratic Rep} 13 Ethiopia 18 Iran
12 Tanzania 18 Nigeria 17 Russian Federation 12 Brazil 14 Zimbabwe
12 Cameroon
10 75
QUESTION 5:
Respondent’s Primary Science Domain (for people who answered yes to international
collaboration)
This question was answered by 3,253 persons, with distribution by science domain shown
below.
Response Count
Math and Physical Sciences
Biological Sciences
Social, Behavioral and Economic Science
Computer & Information Science
GeoScience
Engineering
Education & Human Resource Development
0
76
200
400
600
800
1000
1200
7.4. APPENDIX D: LIST OF PERSONS INTERVIEWED

Celeste Anderson, Director, External Networking Group, University of Southern
Califormia, , August 22, 2013

Jaqueline Brown, Associate Vice President for Information Technology Partnerships
at the University of Washington and Executive Director for International Partnerships
for the Pacific Northwest Gigapop/Pacific Wave,, August 2013

Maxine Brown, Associate director of the Electronic Visualization Laboratory (EVL)
at the University of Illinois at Chicago (UIC). August 13, 2013.

K-C Claffy, is principal investigator for the distributed Cooperative Association for
Internet Data Analysis (CAIDA), and resident research scientist based at the
University of California's San Diego Supercomputer Center.

Andy Connolly, Professor of Astronomy, University of Washington. (no date)

Julio Ibarra, Florida International University and AmLight Network, Jan. 22, 2014.

David Lassner, President, University of Hawaii

Tom Lehman, Director of Research, MidAtlantic Crossroads (MAX), August 14,
2013

Joe Mambretti, Director of the International Center for Advanced Internet Research at
Northwestern University, August 22, 2013.

Larry Mays, Chair, Department of BioInformatics, University of North CarolinaCharlotte, April 2013.

Josh Polterock, Manager of Scientific Projects, Center for Applied Internet Data
Analysis (CAIDA).

Tripti Sanha, MidAtlantic Crossroads (MAX), August 14, 2013

Dan Sturman, Ph.D., Engineering Director, Google. October 2013

John Silvester, Executive Director of Center for Scholarly Technology. Professor or
Computer Engineering, USC, August 2013
77
7.5. APPENDIX E: REPORTS USED FOR THIS STUDY
 National Science Foundation (NSF) United States Antarctica Program (USAP) Science
Workshop, 27 February 2012. NASA Program Division (NPD) Civil and Commercial
Operations (CCO)

International Big Science List, Jim Williams, Di-Lu, and Ed Moynihan

Cisco Visual Networking Index: 2013-2018

NIST Big Data Interoperability Framework: Volume 3, Use Cases and General
Requirements. Draft Version1, April 2014, http://dx.doi.org/10.6028/ NIST.SP.XXX

SECURITY AT THE CYBER BORDER: EXPLORING CYBERSECURITY FOR
INTERNATIONAL RESEARCH NETWORK CONNECTIONS. Welch, Von; Pearson,
Douglas; Tierney, Brian; Williams, James

K.C. Claffy and J. Polterock, “International Research Network Connections: Usage and
Value Measurement'', Tech. rep., Cooperative Association for Internet Data Analysis
(CAIDA), May 2013.
7.6. APPENDIX F: SCIENTIFIC AND NETWORK COMMUNITY MEETINGS
ATTENDED FOR REPORT INPUT
TERENA 2012 (Reykjavik, Iceland): Attended conference and met with several network
engineers, exchange point operators, international R&E network operators, network researchers.
CHEP 2012 (New York, New York): Attended conference and met with HEP (high energy
physics) researchers. Organized a short BOF to introduce researchers to this project and request
information for this report.
Internet2 Meetings (Joint Techs, TIP 2013, Member Meeting – various US locations): Attended
meetings to learn more about projects involving international collaborations and large datasets.
Interviewed IRNC PI’s, network operators.
CANS 2012 (Seattle, WA): Met researchers, network operators, exchange point operators,
researchers working on collaborations between the US and China.
GLIF 2012 (Chicago, IL): Met with IRNC PIs in attendance, other operators and researchers
working on data-intensive international collaborations. Noted upcoming network upgrades to
current international connections and applications driving upgrades. Introductions to additional
POC for additional related projects.
Various site visits/interviews: Indiana University (ACE/TP3), UCSD (CAIDA.org, various
researchers in HEP, Oceanography), SDSC (network operators, HPC specialists/scientists),
NCAR (network operators, Climatology researchers), ESnet (Network engineers dealing with
tuning and data-intensive networking projects), Starlight operators, CENIC operators, PNWGP
operators.
78