Big data now playing ..... at the sandbox 17
Transcription
Big data now playing ..... at the sandbox 17
Big data now playing ..... at the sandbox John.Dunne@cso.ie 17th October 2014 IAOS, Vietnam Overview • • • • • • • Context How CSO got interested in big data The sandbox Learning from other industries Learning from the past The sandbox – looking to the future Concluding comments Keywords – big data, modernisation, sandbox 2 Big data – working definition Data that is difficult to collect, store or process within the conventional systems of statistical organizations. Either, their volume, velocity, structure or variety requires the adoption of new statistical software processing techniques and/or IT infrastructure to enable cost-effective insights to be made. 3 Do more with less Mindset - Opportunities exist with secondary data sources 4 Legal environment Data Protection Freedom of Information Key : 3 Legislative pillars Official Statistics 5 Modernisation and big data 2011 Conference of European Statisticians endorse modernisation strategy 2012 Big data on modernisation agenda 2013 ESSC Scheveningen memorandum on Big data and official statistics 2013 International Big data team gets going 2014 Big data on UNSC agenda 2014 The sandbox goes live at MSIS Dublin 2013 CSO Project - To determine household composition using smart metering data Origin of data : Consumer Behaviour Trials in 2009 and 2010 • Over 5000 households in pilot • 3 months baseline data (reading every 30 mins) • Pre-trial survey using CATI http://www.unece.org/stats/documents/2013.09.coll.html 7 Project with pilot data brought challenges Pilot Go live 7 million data points per month ICHEC helped out 2160 million data points per month Joe, we need a bigger computer https://www.ichec.ie/ 8 The sandbox The hardware on which the sandbox system is based is a High Performance Computing cluster called Stoney. The cluster is hosted in the National University of Ireland, Galway since April 2009 and is composed of 60 compute nodes each of which has two 2.8GHz Intel (Nehalem EP) Xeon X5560 quad-core processors, 48GB of RAM and a 1TB local disk. Each node is connected to two networks – an InfiniBand network for accessing the shared Lustre filesystem and for high performance communications as well as a Gigabit Ethernet network for management tasks. In addition, a 20TB shared filesystem is available to all nodes. ICHEC will dedicate 20 compute nodes to enable a Hadoop cluster with 160 cores almost 1TB of RAM and 20TB of HDFS distributed storage. The sandbox provides an environment to o test feasibility of remote access and processing o test whether existing standards/models/methods can be applied to big data o evaluate the usefulness of big data software tools o learn by doing with respect to potential uses, advantages and disadvantages of big data o facilitate further collaboration in the international community 10 The toys (data sources) o twitter data o mobile phone data o satellite imagery / aerial photography o price data/ job vacancy data via scraping o scanner data/price data sourced via large vendors o data from road traffic sensors o smart meter data on electricity/gas consumption 11 Some of the players To play, contact Steven.Vale@unece.org 12 Learning from other industries - technical partners can have a role to play Exchange of data for billing purposes Irish Mobile Network Operators MNOs Data Clearing Houses ROW Mobile Network Operators Learning from the past - think about the bigger picture Nordbotten, Thygesen and the statistical archive concept 14 Learning from the past - do not underestimate privacy concerns http://www.census.gov/history/pdf/kraus-natdatacenter.pdf http://blog.modernmechanix.com/the-national-data-center-and-personal-privacy/ The National Data Center and Personal Privacy By Arthur R Miller The sandbox - looking to the future o Centres for Research and Development ? o Centres of Excellence ? o Partner organisations for collecting, processing or storing data of a less or non sensitive nature ??? o Significant partner organisations enabling the collection, processing or storing data of a sensitive nature ????? 16 Concluding remarks • • • • • Think about bigger picture / broader system An open mind to the possibility of new partners Be open and transparent Don’t underestimate privacy concerns Continue to collaborate and share