Data science

Data science fuses mathematics, statistics, and computer science, using aspects of all three disciplines to gain insight from data sources of all kinds.

Photo of the Data Science Group members

Data science incorporates the analysis of large and complex datasets, machine learning, artificial intelligence, and a host of new algorithms that can reveal structure in data, perform classifications, and make predictions.

Researchers in the Data Science Group develop new algorithms and insights in a variety of applications. Their research includes ecology, geophysics, the analysis of social networks, classification, and clustering.

Researchers

Professor of Statistics and Data Science
School of Mathematics and Statistics

Professor of Mathematics
School of Mathematics and Statistics

Professor in Statistics and Data Science
School of Mathematics and Statistics

Associate Professor
School of Mathematics and Statistics

Senior Lecturer in Data Science
School of Mathematics and Statistics

Senior Lecturer in Statistics and Data Science
School of Mathematics and Statistics

Lecturer in Computational Mathematics and Statistics
School of Mathematics and Statistics

Current projects

AviaNZ: Making Sure New Zealand Birds Are Heard (Marsden Fund: Prof Stephen Marsland)

The old mantra that you can’t improve things that you can’t measure provides particularly interesting challenges in conservation and ecology, and they are all about data. Many of the species that we wish to know more about are difficult to spot, increasingly rare, and often in remote places. However, many of them call. Call count surveys exploited this to enable some estimates of population densities to be made. In recent years, acoustic recorders have become widespread. These simple machines can be left in the bush and record onto an SD card at pre-specified times for weeks or months. The result is large amounts of data about the soundscape of the bush areas of our country.

Unfortunately, while the recording of the soundscape is easy, the analysis of the recordings is not. The AviaNZ project is attempting to fill this gap. We are interested in every part of the problem of estimating population abundance from unobtrusive measurements in the forest (principally sound, but potentially also photographs and videos), from protocols for their use, through mathematical and signal processing to perform automatic detection and recognition, via easy-to-use and freely-available software, to statistical methods of estimating abundance reliably.

Some of the challenges that make this a hard problem are that the birds are at varying distance from the microphone, from a few metres to hundreds, there are many other noises—wind, rain, rivers, aeroplanes and other human sounds, and other animals to name just a few—and that the range of noises that the birds themselves make can vary so much even within a species.

A data-science driven evolution of aquaculture for building the blue economy (Prof Richard Arnold, AProf Ivy Liu and Dr Binh Nguyen)

Focusing on the farming of Greenshell™ mussels and finfish, the aim of this project is to develop innovative data science, artificial intelligence (AI), and machine learning techniques that will enable the aquaculture industry to keep growing efficiently and at large scale, producing high-quality, low-carbon protein for New Zealand and the world without compromising the environment.

The researchers from the School of Engineering and Computer Science (Mengjie Zhang, Bing Xue, Yi Mei, Harith Al-Sahaf) and from the School of Mathematics and Statistics (Ivy Liu, Richard Arnold, and Binh Nguyen) are working with Plant & Food Research, Cawthron Institute, and University of Otago researchers on the Ministry of Business, Innovation and Employment-funded project, which has been awarded $13 million.

The research team is developing innovative, evolutionary, and statistical learning techniques for use in the aquaculture industry by collecting and incorporating various types of date. Farm managers will be able to use the data to drive decision-making when responding to climate challenges, managing disease, improving production yields, and farming sustainably at scale.

These learning techniques will also help create better AI, which can be used to expand the capacity of the mussel and finfish farms.

A significant focus for the programme is building Māori capacity in data science. Māori own significant aquaculture assets but are under-represented in the field of data science. This project aims to bring together data science and Māori communities with aquaculture interests and will help produce the next generation of Māori graduates capable of leading the technology development needed to scale up the industry.

A total of 12 PhDs, 16 Master’s and 35 Honours students will be involved in the project along with 5 postdoctoral fellows and 35 summer research projects. This will grow New Zealand’s capacity in data science by embedding academically trained, early-career scientists across a range of organisations linked to the aquaculture sector.

Dimension reduction for mixed type multivariate data (Marsden Fund: Prof Richard Arnold and AProf Ivy Liu)

Multivariate data analysis methods are typically restricted by the assumption that the data are all of the same type. However, many data sources contain data of mixed type: for example, in a health survey data may be binary (family history of cancer, yes/no), nominal (ethnicity), ordinal (self-rated health, from poor to excellent), count (number of times under anaesthesia), or continuous (weight). Other examples of such data include ecological data on species/sites, and the large and complex big data collections that are increasingly common in biology (especially genetics), commerce, and computer science. We may wish to find groups of respondents, each containing individuals who are similar in their patterns of response, and groups of questions that have correlated responses.

Where questions are found to be correlated their redundancy can be exploited and a reduced set of questions can be used in analyses. This dimension reduction is particularly important in large datasets (thousands of variables) where a full analysis is computationally infeasible.

In this project we will develop new methods for finding correlation structures within potentially large mixed type datasets. We will use finite mixtures to detect groups of similar individuals, and latent structures to identify correlated variables, thus enabling dimension reduction.

Phenotypic drug discovery using deep learning (University Research Fund: Dr Binh Nguyen)

This project aims at developing a novel deep-learning framework to analyse more than 400 FDA-approved compounds in order to identify drugs that can be repurposed for the treatment of specific diseases.

De novo drug discovery for type 2 diabetes mellitus treatment using deep-learned generative models (Science for Technological Innovation (SfTI 2019 Seed Projects), Callaghan Innovation)

The project aims at using deep-learning approaches to generate new medications for treatment of type II diabetes. These new drugs will be designed by combining the structures of currently known medications and herbal medicines.

Statistical Information Theory and Geometry for Networks, Signal, and Image Processing and Analysis (Brazilian National Council for Scientific and Technological Development (CNPq) Fund: Prof Alejandro Frery)

This research project aims to advance the state of the art in the use of techniques derived from Information Theory and Information Geometry combined with statistical modeling of data. We study complex networks, signals, and images. Among the images, the main focus is on processing and extracting information from data generated by SAR sensors (Synthetic Aperture Radar), in particular Polarimetric SAR images. The signals are treated with non-parametric techniques. We study descriptors of complex networks based on Information Theory.

Other projects

Dr. Louise McMillan

I am researching model-based clustering methods for mixed data types, particularly categorical data. There are few existing methods for clustering mixed data that includes categorical data, and many of those methods use rely on restrictive assumptions that are not appropriate in certain real world scenarios. I am developing more flexible methods that allow for strong correlations between variables, and also strong correlations between individuals in the same cluster. I work with a variety of datasets, from fisheries data to survey data to ecological data, but I am particularly focused on methods that could be applied to population genetics for conservation management.

Completed projects

Distance and Direction Estimation for Acoustic Bird Monitoring (2017 NSC Science for Technological Innovation Seed Projects, Prof Stephen Marsland).

Postgraduate opportunities

There are a variety of scholarships available for students studying at the Wellington Faculty of Science.

From 2021 we are offering a new Master of Data Science. Until then, students interested in postgraduate study can select taught courses from existing programmes in Mathematics, Statistics, and Computer Science.

Students interested in postgraduate research can study towards an MSc by thesis or a PhD under the supervision of staff members in the School of Mathematics and Statistics. Joint supervision with staff from the School of Engineering and Computer Science is also possible.

Further information about PhD study, including scholarship funding, is available on the website of the Faculty of Graduate Research.

In addition, some staff members may have grant funding for PhD research on specific projects.

Prospective research students are encouraged to contact potential supervisors directly.