0737091 COVER SHEET FOR PROPOSAL TO THE NATIONAL SCIENCE FOUNDATION NSF 07-543 05/09/07
Transcription
0737091 COVER SHEET FOR PROPOSAL TO THE NATIONAL SCIENCE FOUNDATION NSF 07-543 05/09/07
Corrected : 05/09/2007 COVER SHEET FOR PROPOSAL TO THE NATIONAL SCIENCE FOUNDATION PROGRAM ANNOUNCEMENT/SOLICITATION NO./CLOSING DATE/if not in response to a program announcement/solicitation enter NSF 04-23 NSF 07-543 FOR NSF USE ONLY NSF PROPOSAL NUMBER 05/09/07 FOR CONSIDERATION BY NSF ORGANIZATION UNIT(S) 0737091 (Indicate the most specific unit known, i.e. program, division, etc.) DUE - CCLI-Phase 1: Exploratory DATE RECEIVED NUMBER OF COPIES DIVISION ASSIGNED FUND CODE DUNS# 05/08/2007 2 11040000 DUE EMPLOYER IDENTIFICATION NUMBER (EIN) OR TAXPAYER IDENTIFICATION NUMBER (TIN) 7494 077817450 SHOW PREVIOUS AWARD NO. IF THIS IS A RENEWAL AN ACCOMPLISHMENT-BASED RENEWAL FILE LOCATION (Data Universal Numbering System) 05/13/2007 11:29am S IS THIS PROPOSAL BEING SUBMITTED TO ANOTHER FEDERAL AGENCY? YES NO IF YES, LIST ACRONYM(S) 540836354 NAME OF ORGANIZATION TO WHICH AWARD SHOULD BE MADE ADDRESS OF AWARDEE ORGANIZATION, INCLUDING 9 DIGIT ZIP CODE George Mason University 4400 University Drive, MSN 4C6 Fairfax, VA. 220304443 George Mason University AWARDEE ORGANIZATION CODE (IF KNOWN) 0037499000 NAME OF PERFORMING ORGANIZATION, IF DIFFERENT FROM ABOVE ADDRESS OF PERFORMING ORGANIZATION, IF DIFFERENT, INCLUDING 9 DIGIT ZIP CODE PERFORMING ORGANIZATION CODE (IF KNOWN) IS AWARDEE ORGANIZATION (Check All That Apply) (See GPG II.C For Definitions) TITLE OF PROPOSED PROJECT MINORITY BUSINESS IF THIS IS A PRELIMINARY PROPOSAL WOMAN-OWNED BUSINESS THEN CHECK HERE Curriculum for an Undergraduate Program in Data Sciences - CUPIDS REQUESTED AMOUNT PROPOSED DURATION (1-60 MONTHS) 150,000 $ SMALL BUSINESS FOR-PROFIT ORGANIZATION 24 REQUESTED STARTING DATE 01/01/08 months SHOW RELATED PRELIMINARY PROPOSAL NO. IF APPLICABLE CHECK APPROPRIATE BOX(ES) IF THIS PROPOSAL INCLUDES ANY OF THE ITEMS LISTED BELOW BEGINNING INVESTIGATOR (GPG I.A) HUMAN SUBJECTS (GPG II.D.6) DISCLOSURE OF LOBBYING ACTIVITIES (GPG II.C) Exemption Subsection PROPRIETARY & PRIVILEGED INFORMATION (GPG I.B, II.C.1.d) INTERNATIONAL COOPERATIVE ACTIVITIES: COUNTRY/COUNTRIES INVOLVED or IRB App. Date HISTORIC PLACES (GPG II.C.2.j) (GPG II.C.2.j) SMALL GRANT FOR EXPLOR. RESEARCH (SGER) (GPG II.D.1) VERTEBRATE ANIMALS (GPG II.D.5) IACUC App. Date PI/PD DEPARTMENT PI/PD POSTAL ADDRESS MS 5C3 Science & Technology I, Room 109 PI/PD FAX NUMBER Fairfax, VA 220304443 United States 703-993-1993 NAMES (TYPED) HIGH RESOLUTION GRAPHICS/OTHER GRAPHICS WHERE EXACT COLOR REPRESENTATION IS REQUIRED FOR PROPER INTERPRETATION (GPG I.G.1) High Degree Yr of Degree Telephone Number Electronic Mail Address PhD 1989 703-993-3617 jwallin@gmu.edu PhD 1983 703-993-8402 kborne@gmu.edu PhD 1976 703-993-1671 dcarr@galaxy.gmu.edu Ph.D. 1974 703-993-1994 jgentle@gmu.edu PhD 2000 703-993-1361 rweigel@gmu.edu PI/PD NAME John F Wallin CO-PI/PD Kirk D Borne CO-PI/PD Daniel B Carr CO-PI/PD James E Gentle CO-PI/PD Robert S Weigel Page 1 of 2 Electronic Signature CERTIFICATION PAGE Certification for Authorized Organizational Representative or Individual Applicant: By signing and submitting this proposal, the individual applicant or the authorized official of the applicant institution is: (1) certifying that statements made herein are true and complete to the best of his/her knowledge; and (2) agreeing to accept the obligation to comply with NSF award terms and conditions if an award is made as a result of this application. Further, the applicant is hereby providing certifications regarding debarment and suspension, drug-free workplace, and lobbying activities (see below), as set forth in Grant Proposal Guide (GPG), NSF 04-23. Willful provision of false information in this application and its supporting documents or in reports required under an ensuing award is a criminal offense (U. S. Code, Title 18, Section 1001). In addition, if the applicant institution employs more than fifty persons, the authorized official of the applicant institution is certifying that the institution has implemented a written and enforced conflict of interest policy that is consistent with the provisions of Grant Policy Manual Section 510; that to the best of his/her knowledge, all financial disclosures required by that conflict of interest policy have been made; and that all identified conflicts of interest will have been satisfactorily managed, reduced or eliminated prior to the institution’s expenditure of any funds under the award, in accordance with the institution’s conflict of interest policy. Conflicts which cannot be satisfactorily managed, reduced or eliminated must be disclosed to NSF. Drug Free Work Place Certification By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant is providing the Drug Free Work Place Certification contained in Appendix C of the Grant Proposal Guide. Debarment and Suspension Certification (If answer "yes", please provide explanation.) Is the organization or its principals presently debarred, suspended, proposed for debarment, declared ineligible, or voluntarily excluded from covered transactions by any Federal department or agency? Yes No By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant is providing the Debarment and Suspension Certification contained in Appendix D of the Grant Proposal Guide. Certification Regarding Lobbying This certification is required for an award of a Federal contract, grant, or cooperative agreement exceeding $100,000 and for an award of a Federal loan or a commitment providing for the United States to insure or guarantee a loan exceeding $150,000. Certification for Contracts, Grants, Loans and Cooperative Agreements The undersigned certifies, to the best of his or her knowledge and belief, that: (1) No federal appropriated funds have been paid or will be paid, by or on behalf of the undersigned, to any person for influencing or attempting to influence an officer or employee of any agency, a Member of Congress, an officer or employee of Congress, or an employee of a Member of Congress in connection with the awarding of any federal contract, the making of any Federal grant, the making of any Federal loan, the entering into of any cooperative agreement, and the extension, continuation, renewal, amendment, or modification of any Federal contract, grant, loan, or cooperative agreement. (2) If any funds other than Federal appropriated funds have been paid or will be paid to any person for influencing or attempting to influence an officer or employee of any agency, a Member of Congress, an officer or employee of Congress, or an employee of a Member of Congress in connection with this Federal contract, grant, loan, or cooperative agreement, the undersigned shall complete and submit Standard Form-LLL, ‘‘Disclosure of Lobbying Activities,’’ in accordance with its instructions. (3) The undersigned shall require that the language of this certification be included in the award documents for all subawards at all tiers including subcontracts, subgrants, and contracts under grants, loans, and cooperative agreements and that all subrecipients shall certify and disclose accordingly. This certification is a material representation of fact upon which reliance was placed when this transaction was made or entered into. Submission of this certification is a prerequisite for making or entering into this transaction imposed by section 1352, Title 31, U.S. Code. Any person who fails to file the required certification shall be subject to a civil penalty of not less than $10,000 and not more than $100,000 for each such failure. AUTHORIZED ORGANIZATIONAL REPRESENTATIVE SIGNATURE DATE NAME Karen G Cohn TELEPHONE NUMBER 703-993-4104 Electronic Signature ELECTRONIC MAIL ADDRESS May 8 2007 7:42PM FAX NUMBER kcohn@gmu.edu 703-993-2296 *SUBMISSION OF SOCIAL SECURITY NUMBERS IS VOLUNTARY AND WILL NOT AFFECT THE ORGANIZATION’S ELIGIBILITY FOR AN AWARD. HOWEVER, THEY ARE AN INTEGRAL PART OF THE INFORMATION SYSTEM AND ASSIST IN PROCESSING THE PROPOSAL. SSN SOLICITED UNDER NSF ACT OF 1950, AS AMENDED. Page 2 of 2 NATIONAL SCIENCE FOUNDATION Division of Undergraduate Education NSF FORM 1295: PROJECT DATA FORM The instructions and codes to be used in completing this form are provided in Appendix II. 1. Program-track to which the Proposal is submitted: CCLI-Phase 1: Exploratory 2. Name of Principal Investigator/Project Director (as shown on the Cover Sheet): Wallin, John 3. Name of submitting Institution (as shown on Cover Sheet): George Mason University 4. Other Institutions involved in the project’s operation: Project Data: A. Major Discipline Code: 35 B. Academic Focus Level of Project: BO C. Highest Degree Code: D D. Category Code: -E. Business/Industry Participation Code: NA F. Audience Code: G. Institution Code: PUBL H. Strategic Area Code: IT I. Project Features: C A Estimated number in each of the following categories to be directly affected by the activities of the project during its operation: J. Undergraduate Students: 80 K. Pre-college Students: 0 L. College Faculty: 0 M. Pre-college Teachers: 0 N. Graduate Students: 0 NSF Form 1295 (10/98) Curriculum for an Undergraduate Program in Data Sciences CUPIDS The goal for this project is to increase student's understanding of the role that data plays across the sciences as well as to increase the student's ability to use the technologies associated with data acquisition, mining, analysis, and visualization. Based on this goal, we have created five objectives in this project: 1. To teach students what Data Sciences is and how it is changing the way science is being done across the disciplines 2. To change student's attitudes about using computers for scientific analysis and improve their confidence in using computers to scientific data problems 3. To increase student's abilities in using visualization to examine scientific questions 4. To increase student's abilities to use databases for scientific inquiry 5. To increase student's abilities to acquire, process and explore experimental data with the use of a computer The objectives for this project will be achieved within the new Bachelor of Science degree in Computational and Data Sciences at the Fairfax Campus of George Mason University (Mason). This new undergraduate degree and the associated curriculum were reviewed and approved in May 2007 by both the University and the State Council of Higher Education in Virginia. This proposal seeks funding through the CCLI program to develop the curricular elements in four courses (12 credits) within this degree program and evaluate their effectiveness for teaching Data Science to undergraduate students. Intellectual Merit The Data Sciences curriculum at Mason falls in line with the recommendations and goals of several national agency and national academy reports that detail the exponential expansions of data. The urgency and need for such a curriculum cannot be overstated. The NSFs Atkins Report stated it this way: "The importance of data in science and engineering continues on a path of exponential growth; some even assert that the leading science driver of high-end computing will soon be data rather than processing cycles. Thus it is crucial to provide major new resources for handling and understanding data." The core and most basic resource is the human expert, trained in key data science skills. In a recent Data Sciences Journal article [Smith, 2006], it is argued that now is the time for Data Sciences curricula. Further, a recent NSF-cosponsored workshop on Data Repositories stated "Data-driven science is becoming a new scientific paradigm -- ranking with theory, experimentation, and computational science." Broader Impact The broader impact of this proposal will be seen in four key ways. First, by creating curricular materials, this project lowers the barriers for adoption of a Data Sciences curriculum by other Universities. This area is an emerging academic discipline, and few curricular materials are available for new programs. Second, it explores the pedagogical effectiveness of such a curriculum in terms of conceptual understanding as well as cognitive and affective changes. Third, this project will create college graduates who are ready to meet the national, regional and local workforce needs to respond to the upcoming flood of data. The creation of this workforce is critical to economic growth and international competitiveness. Finally, this program will create faculty expertise in teaching Data Sciences to undergraduate students that will be shared with other Universities through formal and informal presentations and contacts. -1- TABLE OF CONTENTS For font size and page formatting specifications, see GPG section II.C. Total No. of Pages Page No.* (Optional)* Cover Sheet for Proposal to the National Science Foundation Project Summary (not to exceed 1 page) 1 Table of Contents 1 Project Description (Including Results from Prior 15 NSF Support) (not to exceed 15 pages) (Exceed only if allowed by a specific program announcement/solicitation or if approved in advance by the appropriate NSF Assistant Director or designee) 1 References Cited Biographical Sketches (Not to exceed 2 pages each) Budget 10 4 (Plus up to 3 pages of budget justification) Current and Pending Support 5 Facilities, Equipment and Other Resources 1 Special Information/Supplementary Documentation 0 Appendix (List below. ) (Include only if allowed by a specific program announcement/ solicitation or if approved in advance by the appropriate NSF Assistant Director or designee) Appendix Items: *Proposers may select any numbering mechanism for the proposal. The entire proposal however, must be paginated. Complete both columns only if the proposal is numbered consecutively. Curriculum for an Undergraduate Program in Data Sciences CUPIDS 1.0 Project Goal and Objectives This project is in response to the Phase I call for proposals through the NSF Course, Curriculum and Laboratory Improvement (CCLI) program of the National Science Foundation. The project is designed to respond to two of the five elements of the cyclical model of practice in undergraduate STEM education. First, we will create new learning materials and teaching strategies within the field of Data Sciences. Second, we will assess student achievement after applying these materials within our courses to determine how the pedagogical goals of the proposal have been achieved. The goal for this project is to increase student's understanding of the role that data plays across the sciences as well as to increase the student's ability to use the technologies associated with data acquisition, mining, analysis, and visualization. We have five objectives for this project: 1. To teach students what Data Sciences is and how it is changing the way science is being done across the disciplines 2. To change student's attitudes about using computers to address scientific data problems and improve their confidence in using computers to scientific data problems 3. To increase student's abilities to use visualization for generating and addressing scientific questions 4. To increase student's abilities to use databases for scientific inquiry 5. To increase student's abilities to acquire, process and explore experimental data with the use of a computer The outcomes and metrics for evaluating these goals are detailed in section 6. The proposed activities in this CCLI proposal are intended to enhance and broaden the impact of the new Bachelor of Science degree in Computational and Data Sciences (CDS) at George Mason University (Mason). The CDS department also administers both MS and PhD degrees in Computational Sciences. The new undergraduate degree and the associated curriculum were reviewed and approved in May 2007 by both the University and the State Council of Higher Education in Virginia. The faculty and infrastructure necessary to support the proposed CCLI activities are already in place. We are scheduled to have our first students enrolled in the program in Fall, 2007. In this new degree, students are required to take 23 credits in Mathematics, 15 in Computer Science, 25 in selected scientific domains, and 28 in General Education. Including electives, a total of 11 new courses (33 credits) in Computational and Data Sciences were created for this new degree program. Of the 11 new courses in the degree program, this proposal seeks support through the CCLI program to develop and evaluate the curricular elements of four key courses in Data Sciences. This degree represents a new direction for integrated science at Mason that is distinctive from other existing Computational Science undergraduate degrees around the country because of its emphasis on the emerging field of Data Sciences. Beyond the required courses in Statistics, these new courses we are developing include a: � one semester course entitled Introduction to Computational and Data Sciences ( freshman level) � one semester course in Scientific and Statistical Visualization ( junior level) � one semester course in Scientific Data and Databases ( junior level) � one semester course in Scientific Data Mining ( senior level) Two of these courses, Scientific and Statistical Visualization and Scientific Data and Databases, have been taught at the graduate level at Mason for 15 years. Although redesigning these courses for Junior- 1 and Senior-level students will be challenging, the experience of the instructors with this material combined with their experience teaching undergraduates at all levels will make these modifications relatively straightforward. The Scientific Data Mining course presents some additional challenges in part because there is no direct analogy at the graduate level at Mason. Our university does offer a 15 credit graduate certificate in Data Mining that is taught, in part, by faculty on this proposal team. However, the materials and methods in this course generally require previous courses in advanced statistics, mathematics, and computing that are difficult to transfer to an undergraduate audience of science majors. The primary challenge for developing this course will be selecting the essential techniques and adapting curricular materials to effectively reach the targeted audience. The Introduction to Computational and Data Sciences course also presents special challenges in curricular development. First, this course is designed for freshmen students with minimal mathematics and computing backgrounds. Second, this course, like Scientific Data Mining, has no direct analogy at the graduate level. Several of us on this project have taught a graduate course in Foundations of Computational Science, but Data Sciences was not included in the course. Before continuing, we need to define Data Sciences. The field of Data Sciences encompasses elements of traditional statistical methods, computational statistics, visualization, statistical learning, simulation, modeling, data acquisition, data mining, reduction, analysis, and storage. The Data Sciences are firmly grounded in probability theory, logic, and other areas of mathematical analysis. Many of the methods of the Data Sciences are computationally intensive. However, the computations are not just to "process the data" and to compute summary statistics. Rather, computations serve as a tool of discovery by providing alternative views of the data and allowing exploration of various models suggested by these viewpoints. The above figure illustrates the flow of data from its sources through decision making, illustrating the major elements associated with Data Sciences. The proposed curriculum in Data Sciences includes: � data acquisition/data sources, � reduction and storage in data warehouses, � data exploration and data mining through statistical/machine learning, and visualization techniques, � data presentation. In short, we wish to teach students how data flows from instruments to decision making in all scientific disciplines. In section 2 of this proposal, we discuss the impending flood of data that will be driving science over the 2 next ten years. We provide background into curricular development projects in computational science, along with initiatives to move data into the classroom within disciplines in section 3. Section 4 justifies this initiative in terms of NSF goals and workforce demands. We explain the types of curricular materials that we plan to create in section 5. In section 6 we present our evaluation objectives and metrics to be used in assessing the project. Section 7 details our plans for dissemination, section 8 addresses the broader impact of this proposal, and section 9 contains the management plan. 2.0 Background 2.1 Motivation The growth of data volumes in nearly all scientific disciplines, business sectors, and federal agencies is reaching epidemic proportions. This epidemic is characterized roughly by a doubling of data each year. It has been said that "while data doubles every year, useful information seems to be decreasing" (M. Dunham 2002), and there is a growing gap between the generation of data and our understanding of it (I.Witten & E.Frank 2005). In an information society with an increasingly knowledge-based economy [Drucker, 1999; 2002], it is imperative that the workforce of today and especially tomorrow be equipped to understand data. This understanding includes knowing how to interpret, access, retrieve, use, analyze, mine, and integrate data from disparate sources. This is emphatically true in the sciences as well. The nature of scientific instrumentation, which is becoming more microprocessor-based, is that the scale of data-capturing capabilities grows at least as fast as the underlying computational-based measurement system (J.Gray et al 2005). For example, in astronomy, the fast growth in CCD detector size and sensitivity has seen the average size of a typical large astronomy sky survey project grow from hundreds of gigabytes 10 years ago (e.g., the MACHO survey), to tens of terabytes today (e.g., 2MASS and Sloan Digital Sky Survey [http://www.sdss.org/], J.Gray & A.Szalay 2004), up to a projected size of tens of petabytes 10 years from now (e.g., LSST = Large Synoptic Survey Telescope [http://www.lsst.org/], J.Becla et al. 2006). Consequently, we see the floodgates of data opening wide in astronomy, highenergy physics, bioinformatics, numerical simulation research, geosciences, climate monitoring and modeling, and more. Outside of the sciences, it is widely documented that the data flood is in full force in banking, healthcare, homeland security, drug discovery, medical research, insurance, and (as we all have seen) e-mail. The application of data mining, knowledge discovery, text mining, and e-discovery tools to these growing data repositories is essential to the success of agencies, economies, and scientific disciplines. 2.2 Data Sciences as an Academic Discipline Within the scientific domain, Data Sciences is becoming a recognized academic discipline. In a recent Data Sciences Journal article [Smith, 2006], it is argued that now is the time for Data Sciences curricula. In another article (Cleveland 2001), Data Sciences is again promoted as a rigorous academic discipline. Further, there was a recent (2007) NSF-cosponsored workshop on Data Repositories, which included a track on data-centric scholarship, where they explicitly state what we now believe: "Data-driven science is becoming a new scientific paradigm -- ranking with theory, experimentation, and computational science." Consequently, many scientific disciplines are developing sub-disciplines that are informationrich and data-based, to such an extent that these are now becoming (or have already become) recognized stand-alone research disciplines and academic programs on their own merits. The latter include bioinformatics and geo-informatics, but will soon include astro-informatics, e-Science, medical/health informatics, computational learning and statistics, and data science. Several national study groups have issued reports on the urgency of establishing scientific and educational programs to face the data flood challenges. These include: 1. National Academy of Sciences report: "Bits of Power: Issues in Global Access to Scientific Data" (1997); 3 2. 3. 4. 5. 6. NSF report on "Knowledge Lost in Information: Report of the NSF Workshop on Research Directions for Digital Libraries" (2003); NSB (National Science Board) report on "Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century" (2005); NSF "Atkins Report" on "Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure" (2005); NSF and ARL (Association of Research Libraries) report on "Long-term Stewardship of Digital Data Sets in Science and Engineering" (2006); NSF report on "Cyberinfrastructure Vision for 21st Century Discovery" (2007). Each of these reports issues a call to action and a herald's cry to respond to the data avalanche in science, engineering, and the global scholarly environment. 2.3 Masons Role in the Data Sciences In order to train a workforce that is prepared to succeed in the data-drenched knowledge-based economy and scientific research domains, we aim to develop a Data Sciences undergraduate program at George Mason University (Mason). Within Masons Department Computational and Data Sciences (CDS), the context of the Data Sciences program will be primarily the sciences: astronomy, physics, biology, chemistry, and related sub-disciplines such as computational fluid dynamics, numerical simulation research, bioinformatics, and computational materials science. The tool set that students in the Data Sciences curriculum will encounter (and be trained to use) include statistics, databases, data mining, visualization, and data structures. The Data Sciences curriculum at Mason falls in line with the recommendations and goals of the national studies and reports cited above. The urgency and need for such a curriculum cannot be overstated. The Atkins Report stated it this way: "The importance of data in science and engineering continues on a path of exponential growth; some even assert that the leading science driver of high-end computing will soon be data rather than processing cycles. Thus it is crucial to provide major new resources for handling and understanding data." The core and most basic resource is the human expert, trained in key data science skills. As stated in the 2003 NSF "Knowledge Lost in Information" report, human cognition and human capabilities are fundamental to successful leveraging of cyberinfrastructure, digital libraries, and national data resources. 3.0 Relation to Other Programs 3.1 Curricular Development in Computational Sciences Over the last ten years, a number of academic institutions have developed courses and curriculum appropriate for undergraduates studying computational science. In fact, Computational Science courses have been developed at all academic levels from K-12 through graduate school. By combining and creating advanced courses in topics including simulation and high performance computing with traditional courses in engineering, science, mathematics, numerical methods, and computer science, there is a relatively well established canon of topics and teaching materials available for students at all academic levels. Much of this material is in the form of freely available on-line modules: � One of the earliest on-line resources for Computational Science was the Computational Sciences Education Program (CSEP) developed under a grant by the Department of Energy (http://www.phy.ornl.gov/csep/). This electronic book provided instructors and students information on tools and methods along with selected case studies in Computational Science. � Much of the on-line curricular materials for Computational Science have been produced at Supercomputing facilities. The National Parallel Architectures Center at Syracuse University (http://www.npac.syr.edu/Education/) is one example of this type of site, although similar 4 � � outreach activities have existed at the San Diego Supercomputing Center and the National Center for Super Computer Applications. The links to Computational Science educational materials is excellent at this site (http://www.npac.syr.edu/projects/cpsedu/CSEmaterials/). The Keck Undergraduate Computational Science Educational Consortium (KUCSEC) provides an extensive set of modules from Capital University in Columbus OH (http://oldsite.capital.edu/acad/as/csac/Keck/index.html). In addition to the modules, this site has offers a detailed guide for authors who write educational modules in Computational Science (http://oldsite.capital.edu/acad/as/csac/Keck/guidebook.html). We plan to adopt this template for the Data Sciences modules that we will create at Mason. The National Computational Science Institute has developed an extensive set of materials and workshops in this field (http://www.computationalscience.org). This on-going grant supports undergraduate education in Computational Science across multiple scientific disciplines. In our new Bachelor of Science in Computational and Data Sciences, we plan to take full advantage of these existing curricular materials within our program. One obvious overlap is in Scientific Visualization, where many previous projects have been created to teach undergraduates how to visually manipulate data. Of course, we will take advantage of these materials, and also add our expertise in information visualization and statistical visualization, and visual data mining to our program. In short, there is no well developed set of curricular materials available for the Data Sciences, particularly at the freshman level and in the area of scientific data mining. This gap is the primary focus of curricular development within this proposal. 3.2 Data Science Related Programs We have found a few schools that are trying programs similar to Data Sciences, but none focus on the broad range of physical and biological sciences that we are proposing. For example, the University of Michigans School of Information has programs (graduate only) in specialties related to data sciences: Archives and Records Management, Community Informatics, Information Analysis and Retrieval. Similarly, as at many other schools, the University of North Carolina School of Information and Library Science has an undergraduate program in Information Science. But all of these differ from our proposed program at Mason: we are uniquely focused on Data Sciences and on data in the sciences, and not on information solely in the Web or Library context. Outside of the U.S., the University of Dortmund (Germany) has a bachelors program in Data Analysis and Data Management, plus a masters program in Data Science. They are strongly focused on advanced mathematical and statistical methods, though their program description does indicate that they incorporate different data sets (including scientific) within the coursework. Closer to our program, we find that Rensselaer Polytechnic Institute has graduate-level data sciences programs within specific science disciplines (bioinformatics and cheminformatics), as does Columbia University (geoinformatics). Of course, there are many universities with geoinformatics and bioinformatics programs (including Mason). And this is our point: we believe that such programs are not sufficient to address the data avalanche that is upon us in all disciplines -- a general sci-informatics curriculum for all of science is needed. This Data Science discipline may be called Discovery Informatics, because it is the discipline of organizing, accessing, mining, analyzing, and visualizing data for scientific discovery. We propose to blend those informatics concepts with Data Sciences methods (databases, data mining, visualization) and then integrate all of this with scientific data from many disciplines (biology, astronomy, physics, space weather, materials science, chemistry, geosciences) throughout our curriculum. The Mason Department of Computational and Data Sciences is created to go into these uncharted waters, and our proposed curriculum development program is the requisite next major step in that direction. Despite the general lack of curricular materials in Data Sciences, there have been several investigations and conferences about incorporating data into the classroom. The project on Using Data in the 5 Classroom at the National Science Digital Library (NSDL) states One of the great promises of the NSDL is the ability to make it easy for students to explore data to answer their own questions. This site provides links to data tools (such as the Data Discovery Toolkit) and sample discipline based scenarios about how data is used in teaching some courses. The other links available at the NSDL under Pedagogical Resources and Activities and Examples will certainly be used within our curriculum as well. At the beginning of the Using Data in the Classroom web page at the NSDL, they state several questions that we will address in this proposal: � What are the learning goals for using data in the classroom? � How do different disciplines use data in the classroom? � What methodologies are held in common or have wide application? � What methodologies hold promise for enhancing interdisciplinary learning? � How do we evaluate the impact of data-based inquiry on learning? This project will include activities that directly or indirectly address all of these questions. In each of the activities at the NSDL in the section Using Data in the Classroom, the scenarios and examples are strongly domain based. Earth sciences students were given problems associated with geothermal gradient data, while students in biology were given projects involving the visualization of organic molecules. The defining themes for the projects were the domains, not the underlying data and analysis techniques. The key focus of our proposal goes beyond just providing data within a domain specific scientific context. Our project aims to teach students how data and data analysis tools are used across scientific disciplines to form and test scientific hypothesis. Although there are differences between our focus and the Using Data in the Classroom project at the NSDL, their materials and lessons learned will be incorporated into our work. One of these lessons (http://serc.carleton.edu/resources/870.html) that this project discusses is the basic modes students interact with data. In summary, students � Generate, collect, and analyze their own data in the context of a larger project, � Use existing data to either ask new questions or answer questions already posed, or � Collect data, develop a model, and compare the model to the data All three of these modes will be used within our courses, particularly in the Introduction to Computational and Data Sciences course. Each of the modes reflects different aspects of the Data Sciences, from acquisition, through reduction, through analysis, comparison with models, to knowledge discovery. One of the other important lessons in this previous work is the tips for designing successful activities. They present nine specific pointers to help to successfully engage students with data. These include: � Design exercises with student background in mind- an overwhelming or negative early experience with data can be devastating to student confidence. � Create a safety net to support students through the challenges of research. � Create opportunities for students to work with data and tools outside of class or lab. Each of these principles (along with others from this report) will be closely followed in our curricula. For our students to have a positive experience when interacting with these complex tools, designing a safety net with peer support, group projects, on-line discussions and on-line office hours is critical to our success. 6 4.0 Justification 4.1 The NSF and Competitiveness In the NSF 2007 Facility Plan (NSF 07-22), several major research facilities are identified, either as ongoing, under construction, new starts, ready for funding consideration, or "horizon" projects. In essentially all of these cases, the facility will be major scientific data producer (e.g., EarthScope, IceCube, NEON, LIGO, ALMA, HIAPER, and more). In order to prepare for the data flood from these major NSF-funded programs, and to reap the maximum scientific return from their investment, it is critical to train young researchers (and pre-research undergraduates) in the ways and means of Data Sciences. The scientific knowledge discovery potential of the databases to be produced by these projects is enormous, as are the challenges in dealing with the corresponding data firehose. NSF has reported in numerous reports and congressional hearings that the agency plans to strengthen science education and address areas of high importance to the nation's future competitiveness. One such area is Data Sciences. 4.2 Market and Workforce Demands The SAS Institute has identified data analytics as one of the key business activities and skill demands of the 21st century. They provide practical business-focused training in the field of business intelligence (data mining for business) and statistics. Though mostly business-oriented, their message is clear: students need hands-on experience with data analytics in order to stand out and have value in an increasingly competitive global marketplace. They claim that data analytics is causing a change not only in the way organizations do business, but also in the skills those organizations are seeking in prospective employees. So, what is data analytics? It is the discipline of organizing data and information, mining data for trends and insights, and enabling data-driven decisions. Data analytics specialists are trained to collect, store, extract, cleanse, transform, aggregate, mine, and analyze data, and most importantly, to convert that analysis into value-added products and action through value creation. These skills represent the key components of the educational program within our Data Sciences undergraduate curriculum at Mason. In CXOToday, a news source for Chief Information Officers, an editorial boldly proclaimed "Data Analytics: The Time Is Now!" (Dec. 29, 2005). They assert that the availability of a data-skilled workforce is now a significant success factor for businesses. The same editorial estimates that the global market for data analytics in 2007 alone is approximately $17B. Other estimates project strong growth in this sector in the years ahead. For example, the Gartner Research Group (a highly regarded think tank for business prognostications) says that businesses are so determined to gain more meaningful insights from their growing data volumes that they have invested more than $40 billion over the past few years into projects that are aimed at mining and gaining knowledge from their reams of operational data, and the projections for the future are at least as strong. In another instance, the global director of ATG Worldwide (a leading international consulting firm in the field of e-commerce) has stated, "This is the golden age of Data Analytics. There is no lack of data, however there is a serious dearth of intelligent interpretation of data." Our proposed data sciences curriculum at GMU will take direct aim at that problem, especially in the sciences, and will train a workforce for tomorrow that is ready and able to fulfill the promise of data mining: extracting information from data, discovering knowledge from information, and converting knowledge into understanding and action. Scientific knowledge discovery will be enhanced as a result. The need for Data Analytics is particularly relevant for the northern Virginia region, which is home to one of the fasted-growing high-technology sectors in the nation. Although we have no doubt this work will 7 have broader impact, Masons location in this high-tech area makes it extremely well suited for this initial project. Many employers in the region are struggling to fill vacancies for college graduates skilled in combining computational methodologies with scientific and mathematical skills as members of interdisciplinary science teams. These employers include Mitre, SAIC, CSC, Hughes, Boeing, AOL, NASA, NSF, NIH, NRL, NIST, NOAA, and many others. In addition to employers, graduate schools seek to recruit students who understand the new ways of doing science. Graduates with the BS degree in Computational and Data Sciences will be qualified to work in private industry and also in government laboratories and bureaus in fields such as computational statistics, mathematics, physics, astronomy, biology, climate dynamics, and Earth observing/remote sensing. The Bureau of Labor Statistics Occupational Outlook Handbook provides some useful insights into the expected demand for alumni of computational science undergraduate programs. It states that: Employment of computing professionals is expected to increase much faster than average (increase by 36% or more between 1998 and 2008) as technology becomes more sophisticated and organizations continue to adopt and integrate these technologies. Additional insight is provided by the comments of Bruce P. Mehlman, Assistant Secretary for Technology Policy, U.S. Department of Commerce, in testimony before the U.S. House of Representatives Subcommittee on Environment, Technology, and Standards on June 24, 2002. In his testimony, Dr. Mehlman states that There has been concern about ensuring that we have a world-class science and engineering workforce for our knowledge-based economy, and furthermore, Approximately 86 percent of the increase in science and engineering jobs (during 2000-2010) will likely occur in computer-related occupations. There is an enormous number of new positions to fill since, as Dr. Mehlman points out, The ten-year occupational employment projections prepared by the U.S. Department of Labor's Bureau of Labor Statistics (BLS) indicate that, between 2000 and 2010, 2.5 million new IT workers will be needed to fill new IT jobs and to replace workers leaving the profession. The Society for Industrial and Applied Mathematics Working Group on Computational Science and Engineering (CSE) Education, which states that: Research in CSE involves the development of state of the art computer science, mathematical and computational tools directed at the effective solution of realworld problems from science and engineering, and furthermore, We believe that CSE will play an important if not dominating role for the future of the scientific discovery process and engineering design There is a strong feeling that the current climate is highly favorable toward interdisciplinary work in science and engineering. Data Sciences will play a complementary and essential role in sciences along with simulation in future discovery. 4.3 WHY Mason? We believe that George Mason University is naturally positioned to lead the development of an undergraduate data science curriculum for the following reasons. � We have extensive experience with graduate education in Computational Science. There have been over one hundred graduates with MS or PhD in Computational Science from Mason. � The new undergraduate degree in Computational and Data Sciences has just been approved by the State Council of Higher Education (the day before this proposal was due!) All the courses in the degree have been approved, but have not yet been taught. � Mason is also located geographically in a region where the demands for a data-skilled workforce are intense and growing. Mason is just outside of the nation's capital, is surrounded by federal agencies that produce, assemble, and analyze vast collections of data (e.g., NIH, DHS, FBI, NSA, NASA, NOAA, FDA, DOE, HHS), and is embedded among many major corporations that support those agencies and/or generate their own vast data collections (e.g., major oil companies, banking institutions, news agencies, and data service providers). We envision opportunities to develop "data science internships" for our students in some of these places. 8 � � � Mason has been consistently rated one of the "most diverse universities" according to the US News and World Reports: Americas Best Colleges. Our first course (CDS 101 Introduction to Computational and Data Sciences) is planned to satisfy a university-wide General Education requirement in the Natural Sciences. This course will reach into the science education programs, thereby promoting teacher professional development and broadening their participation in science and engineering. Mason has several strong research programs, centers, and departments that are focused on dataintensive science and into which the students in our Data Sciences Program can gain practical experience through internships and other involvement. These include groups in Space Weather, Bioinformatics, Earth Systems and Geoinformation Sciences, the Joint Center for Intelligent Spatial Computing, the Geographic Information Center of Excellence, Computational Statistics, Mathematical Sciences, Data Mining, Computational Social Sciences, and Computational Economics (including 2002 Nobel laureate Dr. Vernon Smith), and more. These groups are often looking for students at all levels that are trained in the skills that our program offers. The faculty team in this program have extensive research and teaching experience in Data Sciences: � Kirk Borne program manager for the Space Sciences Data Operations Office contract activity at NASAs Goddard Space Flight Center; currently the lead for community data access for the Large Scale Synoptic Telescope Project; he is among the senior science personnel on NSFs dataintensive National Virtual Observatory program; he has been a participant in NSF DLESE.org data in education workshops; Borne has taught a masters level data mining course at UMUC for several years and the graduate scientific databases course at GMU since 2003. Kirk is also a founding contributor to the blog Data in Education [http://dataineducation.blogspot.com] where critical issues related to the use and efficacy of data in education are discussed. � Dan Carr has taught scientific and statistical visualization to nearly 400 Ph.D. students at Mason; has extensive experience in multidimensional data exploration, visualization, and visual analytics. Carr has conducted NSF-funded digital government research, collaborated with several researchers in different federal statistical agencies, and was one of the lead developers of NCIs State Cancer Profiles web site that is used in visually communicating statistics to health planners across the nation. Carr was on the expert panel developing the five-year Visual Analytic R&D program for the Department of Homeland Security. � James Gentle author of several books on Computational Statistics; extensive experience with many of the graduate level courses in Computational Science at Mason � John Wallin extensive experience teaching courses in computational science, simulations, and high performance computing; chair of the undergraduate program committee in Computational Science; recipient of the Outstanding Teaching Award at Mason. � Robert Weigel The PI of the newly formed Virtual Radiation Belt Observatory project [http://virbo.org] and is one of the lead developers on a suite of visualization software for magnetospheric data [http://www.bu.edu/cism/cismdx] that has been used in research and education. Weigel has also developed short courses for graduate, undergraduate, and military students on the use of visualization of heliospheric data and is currently teaching a graduate course on Statistical Methods in the Space Sciences. We believe that George Mason University is naturally positioned to lead the development and deployment of an undergraduate Data Sciences curriculum. In addition to local resources at Mason (e.g., a world class data science faculty, a strong history and infrastructure in computational sciences and informatics, clear commitment from the university to proceed in this direction, and participation in several dataintensive research programs already), the location of Mason and its ties to Federal laboratories and the high-tech industry and the associated workforce needs make us ideally positioned for this project. 9 5.0 Project Description In this project, we will develop four new undergraduate courses. In section 1, we discussed the challenges associated with creating these courses. In this section, we will talk about the pedagogical approach we will use in the project. We will use two courses Introduction to Computational and Data Sciences and Scientific Data Mining as examples. The work inside the classroom will be highly interactive, so students can learn by doing. Concept testing via the personal response systems will be used throughout the lectures, along with in-class group assignments. Outside the classroom, the readings will be matched with on-line quizzes that can be re-taken to help correct misconceptions and reinforce new ideas. The homework projects will be structured with examples and augmented with peer support groups to help students overcome the technical difficulties of using new software. Some office hours will be held on-line to help students when they are working on their assignments. All of the course will use interdisciplinary, real-world examples to overcome shortfalls seen in some of the material used in scientific computing courses (Murphy et al. 2005). 5.1 Introduction to Computational and Data Sciences This course provides an interdisciplinary introduction to the tools, techniques, methods, and cutting edge results from across the Computational and Data Sciences. Students will be shown how computational tools are fundamentally changing our approach in the experimental, observational and theoretical sciences through the use of data and modeling systems. No mathematical background is assumed, other than high school algebra. Qualitative results will be emphasized, to show the problems, algorithms, and challenges facing researchers today. Examples will be drawn from both the real world familiar to students and also from the frontiers of science where these techniques are being used to solve complex problems. Upon completion of the course, students should be able to: 1. 2. 3. 4. 5. 6. 7. describe how data is represented within a computer, from binary numbers to arrays and databases explain how scientific data is acquired, processed, stored, reduced, and analyzed using computers express how we create knowledge from data and information using visualization and data mining effectively use simple data analysis and data mining software create effective ways to visualize simple data sets conduct and explain simple simulations of complex phenomena express how changing technologies in computing allows us to further scientific research, and how the technological and scientific progress are tied together The modules created as part of this project will be in several forms. Some will be short lectures with concept tests. Others will be reading material with review questions or group projects that can be done outside of class. The emphasis will be to package the material in small units than can be extended, rearranged, or transferred to other educational settings. The topics that will be covered in the Introduction to Computational and Data Sciences class are: � � � � � � � � � � � The Scientific Method - Experiments, Observations, and Models Computer Internals binary numbers and logic circuits Computer Algorithms and Tool introduction to programming in Matlab Data acquisition linking sensors to computers Signal Processing understanding sources of noise and errors Scientific Databases storing and organizing scientific data Data Reduction and Analysis moving from data to information Data Mining moving from information to knowledge Computer Models using mathematics and algorithms to represent reality Computer Simulations solving linear systems Computer Simulations applications 10 � � � Computer Visualization seeing experiments as images High Performance Computing simulation at the cutting edge of technology Future directions in computational science languages, quantum computing and beyond As we have discussed in section 3, there are excellent curricular materials available in Computational Science that are appropriate for topics such as Computer Models, Simulations, and High Performance Computing. For this course and under this proposal, our focus is to develop curricular material for Data acquisition, Signal Processing, Scientific Databases, Data Reduction & Analysis, Data Mining and Computer Visualization. For each of these topics, we will create course material appropriate for both inside and outside the classroom. The material will include: � � � � � � Readings for students taken from existing sources where possible to help introduce new ideas and define concepts Reading comprehension quizzes through WebCT associated with the readings to let the student confront misconceptions in a safe environment where multiple attempts are possible Short lecture segments in the classroom - to introduce ideas, tools, and case studies in Data Sciences and set the stage for interaction and group projects Concept tests for use the lecture segments using the Personal Response System to let the student answer questions, and then confront their misconceptions within peer discussions In-class small group exercises in class to focus student learning to encourage interaction with the class materials Out of class group homework assignments that have students use data science tools on sample projects to have students use real world examples in an environment with scaffolding examples and peer support through groups and through WebCT Each of these elements is designed to build on the students experiences and help them confront misconceptions so they can learn effectively. To describe the environment students will be in during the class, it is helpful to use a simple scenario from a typical class session. Marila came to her CDS 101 class about 10 minutes before the start of the period. She and her friends started talking about the project that was assigned for the next week. When the professor came in, he began a ten minute lecture on how data is added to databases. The reading she finished last night had talked how data is stored in computers, so it seemed like the ideas he talked about werent all that complicated. The WebCT quiz she took last night after she completed the reading asked her to define arrays and array elements, so she already knew the terms the professor was using. After about 10 minutes, Marila and the rest of the class took out their Personal Response System clickers so they could answer the teachers questions about storing data. She and the class first answered the questions on their own, and then were asked by the teacher to discuss the question in small groups and answer it again. After talking with her friends, she convinced them that she had the correct answer. The professor went through a couple more of these short lessons with questions during the hour, and showed the class an example of how a parallel database works. About halfway through the hour, the professor asked the class to break into groups to simulate a search on a distributed data system. All the students wrote down ten numbers between one and one-thousand. The professor then asked the groups to find a quick way to find the closest match to some selected target numbers he put on the class screen. Marila and her group discussed this for a few minutes. It took a lot of thinking to figure out how to do this easily. After about ten minutes, professor asked the groups to report their findings, and then discussed the results. The project they had due for the next week was to enter a data set they had been given a web-based system, and then make some simple queries about it. This week, the data was taken from weather records from around the country. Although all the people in her group had to do similar things, but every student was assigned to use a different city for their project. Doing this seemed pretty difficult at first, but the example from class, the step-by-step example in the homework, and the help Marila got by talking to people on the 11 WebCT discussion board helped her figure out how to make the software work. The professor had encouraged the students to work together on the assignments, as long as they did their own analysis and wrote up their own conclusions for the project. On one of the assignments, she really got stuck and had to ask the professor some questions during his on-line office hours. It was strange, but Marila never like working with numbers before. With the people in her group and the computers, it seemed a lot easier. 5.2 Scientific Data Mining This course provides a broad overview of the data mining component of the knowledge discovery process, as applied to scientific research. As scientific databases have grown at near-exponential rates, so has the difficulty in analyzing these large databases. Data mining is the search for hidden meaningful patterns in such databases (e.g., find the one gene sequence in a large genome DNA database that always associates with a specific cancer). These patterns and relationships are often expressed as rules (e.g., if a blue star-like object is found next to a faint unusual-shaped galaxy in a large astronomy database, then the blue object might be a distant quasar whose outburst in being triggered by a collision with that galaxy; or if a patient takes both Drug A and Drug B, then N% of the time they will develop side effect X). Consequently, data mining is sometimes referred to as the process of converting information from a database format into a knowledge-based rule format. Identifying these patterns and rules from enormous data repositories can provide significant competitive advantage to scientific research projects and in other career settings. Data mining will be motivated and analyzed in this course as the killer app for large scientific databases (i.e., we collect these data into databases in order to facilitate scientific discovery). Data mining techniques, algorithms, and applications will be covered, as well as the key concepts of machine learning, data types, data preparation, previewing, noise handling, feature selection, normalization, data transformation, similarity measures, and distance metrics. Algorithms and techniques will be analyzed specifically in terms of their application to solving particular problems. Several scientific case studies will be drawn from the science research literature, potentially including astronomy, space weather, geosciences, climatology, bioinformatics, numerical simulation research, drug discovery, health informatics, combinatorial chemistry, digital libraries, and virtual observatories. The techniques that are presented will include well known statistical, machine learning, visualization, and database algorithms, including outlier detection, clustering, decision trees, regression, Bayes theorem, nearest neighbor, neural networks, and genetic algorithms. Prerequisites for this course will include the undergraduate Scientific Data and Databases course that is part of our proposed Data Sciences program and also intermediatelevel collegiate mathematics/statistics courses. Upon completion of this course, the student should be able to: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Express the role of data mining within scientific knowledge discovery. Express the most well known data mining algorithms and correctly use data mining terminology. Express the application of statistics, similarity measures, and indexing to data mining tasks. Determine appropriate techniques for classification and clustering applications. Determine approaches used for mining large scientific databases (e.g., genomics, virtual observatories). Recognize techniques used for spatial and temporal data mining applications. Devise an effective scientific data mining case study and design a scientific data mining application. Express the steps in a data mining project (e.g., cleaning, transforming, indexing, mining, analysis). Analyze classic examples of data mining and their techniques. Effectively prepare data for mining and use data mining software packages. Lecture topics in the Scientific Data Mining course will include: � � � Data Mining Roots and Concepts Scientific Motivation for data mining Background Methods (e.g., databases, statistics, visualization, rules-based decision systems) 12 � � � � � � � � Software packages for data mining Data Preparation (e.g., data types, previewing, dirty data, normalization, transformation) Distance and Similarity Metrics for Clustering and Classification Supervised Learning Methods (e.g., decision trees, neural networks, Bayes, markov models) Unsupervised Learning Methods (e.g., clustering, nearest neighbor, link/association analysis) Scientific Data Mining Case Studies Special Topic Data Mining (e.g., text, images, spatio-temporal data, high-performance) Next-Generation Mining (e.g., multi-media, semantic mining, ontologies, knowledge mining) Co-investigator K.Borne has taught a non-scientific data mining graduate course at UMUC for several years and will adapt much of the material from that course to the scientific context for this new undergraduate course at Mason. This will require substantial re-working of the lecture material, the homework and computer lab assignments, the in-class examples, the group project activities, and the exams. Students will be engaged with the course material through these labs, homeworks, projects, and exams, with ongoing formative assessment. Experience with the UMUC graduate course has illuminated key problem areas for students learning this material. These lessons learned and corresponding best practices will be applied in such a way that the skills and understanding level of the students will grow deeper as the course develops. For example, it has been found to be beneficial to learning and retention for the concept of neural networks to be presented several times during the course of the semester (e.g., in the lectures on Data Mining Concepts, Background Methods, Classification, Supervised Learning Methods, and finally Data Mining Case Studies in the latter lecture, a real-world published example is presented from K.Bornes own research in which a neural network was discovered that can use remote sensing satellite data of the Earth to predict locations of grass and woodland wildfires). Group projects and lab assignments are pursued collaboratively, which naturally include active feedback and peer review. Instructor intervention is always available during in-class project/lab time, or during office hours. We have educational training in on-line distance learning, where the students engage in mutual dialogue and problem-solving in an on-line discussion environment. This provides a mostly pressure-free and open-ended context for discovery and insights into new approaches. A novelty to be introduced into the course will be the use of personal response systems for immediate concept testing and learning assessment. The introduction of this highly interactive course component will require substantial new work, especially to produce useful questions in meaningful contexts. All of the above pedagogical approaches will enable us to confront misconceptions as soon as possible. We provide here a simple sample scenario: Marsala comes to the lecture on Special Topic Data Mining with a fairly good idea about what clustering is and how it can be used to segment databases into multiple groupings. He is now confronted with the concept of spatial indexing and how that enables mining and clustering in spatial databases. The instructor outlines the technique, the concepts, and the various spatial indexing schemes that can be used in large databases to speed up knowledge discovery. Then the students are paired off to develop a quick user scenario of spatial data mining, which they will present to the class at the next lecture. Marsalas teammate is Lucinda, who has a knack for knowing the right answer to everything this semester (she took data mining from the computer science department last semester). But now Marsala is in the drivers seat because he worked for the phone company last summer to make some cash he distributed phone books all across Northern Virginia he wished that his workload was more compactly distributed geographically so that he would not have to drive so much. Marsala explains the spatial data clustering problem that he faced, applying it to a scientific scenario (e.g., finding groups of counties in Virginia that have had invasive species infestations over the past 5 years, and spatially correlating those events with water quality problems reported annually by counties to the State Legislature). Lucinda did not get it at first, but Marsalas insightful analogies to clustering marbles by color, and applying this to clustering 13 objects by spatial location now makes perfect sense (using simple numeric Quad tree indexing as the equivalent Dewey decimal system for latitude/longitude indexing, as a surrogate for the arbitrary human county names). Marsala and Lucinda exchange their ideas and analogies on-line, and they develop a show-stopper class presentation. They earn high marks from their peers and their instructor. 6.0 Evaluation To evaluate this project, we have established seven outcomes we hope to achieve with our curriculum: Outcomes Conceptual understanding 1) Improve student's conceptual understanding of the role of Data Sciences within the scientific method as measured by pre- and post-course tests. Outcomes Affective changes 2) Improve the student's attitude about using computers for scientific analysis as measured by a structured evaluation tool 3) Improve the student's confidence in using computers to solve problems with scientific data using a structured interview Outcomes Cognitive changes 4) Improve student's abilities to use and create scientific visualizations as measured by their ability to form and test hypothesis using scientific data they have examined visually. 5) Improve the ability of students to use pre-existing data sources for analysis as measured by their ability to extract and correlate data within and between scientific data sets. 6) Improve the ability of students to gather scientific data using remote sensors as measured by their ability to complete homework assignments and projects. 7) Improve the ability of students to use high level languages to reduce raw data from temporal and imaging sensors as measured through on-line and at-home exercises. The evaluation of these outcomes will be done in conjunction with an external evaluator. For this project, Dr. Laurie Fathe will serve in the role of the evaluator. Dr. Fathe was the Associate Provost for Educational Improvement and the head of Masons Center for Teaching Excellence for the last six years. She was also the former director of the Los Angeles Collaborative for Teacher Excellence, a large NSF-funded project to improve science teaching in college faculty as well as revise K-12 teacher preparation programs. With her consultation, we will design and implement appropriate tools for measuring our success. The evaluators involvement will begin in January 2008, before the first class is taught. She will help create pre-tests, and guide us in the development of formative evaluations and structured interviews. Finally, she will help create summative evaluations for the classes, as well as review the progress based on changes between pre-tests and the final exams. 7.0 Dissemination and Impact The results from this project will be presented at professional conferences along with seminars at selected schools. We feel this work will be of particular interest to those colleagues who have already established Bachelor Degrees in Computational Science, particularly those who are participating with in the National Computational Sciences Institute. By going directly to these schools, we will provide information on the effectiveness of our work, curricular materials, and inspiration to develop this material further. We also hope to learn from their experiences and use their experience to enhance our curriculum. The final results will be published in journals associated with Data Sciences and computational science education. Computing in Science and Engineering, Computational Science & Discover, and Data Sciences are the likely journals for the final results of this project. All of the curricular materials and scenarios will be placed within the National Science Digital Library associated with the pages on Using Data in the Classroom. The research questions address within this proposal fall directly in-line with the goals of this section of the Library. 14 Because this is a new initiative, the initial impact of this proposal is modest. We anticipate having about 20 majors enrolled in our courses after the first year, with approximately 80 students taking courses in a minor in Computational and Data Sciences we hope to offer starting in 2008. Initially, we expect approximately 20 students in the Introduction to Computational and Data Sciences class, moving to approximately 100 students in the class within five years. 8.0 Broader Impact The broader impact of this proposal will be seen in four key ways. First, by creating curricular materials, this project lowers the barriers for adoption of a Data Sciences curriculum by other Universities. This area is an emerging academic discipline, and few curricular materials are available for new programs. Second, it explores the pedagogical effectiveness of such a curriculum in terms of conceptual understanding as well as cognitive and affective changes. Third, this project will create college graduates who are ready to meet the national, regional and local needs to respond to the upcoming flood of data. The creation of this workforce is critical to economic growth and international competitiveness. Finally, this program will create faculty expertise in teaching Data Sciences to undergraduate students that will be shared with other Universities through formal and informal presentations and contacts. 9.0 Project Management and Timeline The work done under this project will be managed by the PI -Wallin. He will be responsible for managing the team, overseeing the curricular development, creation of the assessment tools, as well as the implementation of them within the classroom. He will also lead the effort to publish and disseminate the materials and results of this project to a broader audience. Travel for to consult with other undergraduate programs in Computational Science is requested in year one. Travel to a national conference is requested in year two to present the final results. Two months of summer salary split across the two year project is requested for the PI, along with a single course release. The evaluator for the project, Dr. Laurie Fathe, will lead the development of the metrics associated with this project. The Co-PIs and the PI will all be involved in developing materials and teaching the courses within this new program. The planned schedule for the project is: � � � � � � Fall 2007 Pre-award development of initial curriculum for the Introduction to Computational and Data Sciences course Spring 2008 Project begins; first meetings with the project evaluator, Introduction to Computational and Data Sciences is offered for the first time Summer 2008 Assessment of Introduction to Computational and Data Sciences continues, along with development and improvement of the curricular materials based on student feedback. Talks with other Universities begin, discussing possible collaboration, future adoption and additions to this curriculum. Fall 2008 Scientific Visualization and Scientific Databases offered for the first time; initial presentation of the curriculum at a national conference Spring 2009 Scientific Data Mining offered for the first time, Introduction to Computational and Data Sciences offered for the second time Summer 2009 project ends; results written up for publication, course materials placed on-line and linked into NSDL. Discussions with other Universities continue. 15