Paper accepted in IEEE BIBM 2018 – Leila Yousefi

Title: Opening the Black Box: Discovering and Explaining Hidden Variables in Type 2 Diabetic Patient Modelling

Authors: Leila Yousefi (, Stephen Swift, Mahir Arzoky, Lucia Saachi, Luca Chiovato and Allan Tucker

Abstract: Clinicians predict disease and related complications based on prior knowledge and each individual patient’s clinical history. The prediction process is complex because of the existence of unmeasured risk factors, the unexpected development of complications, and varying responses of patients to the disease
over time. Exploiting hidden variables (i.e., unmeasured risk factors) can improve the modeling of disease progression and being able to understand the semantics of the hidden variables will enable clinicians to focus on the early diagnosis and treatment of unexpected conditions among sufferers. However, the overuse
of hidden variables can lead to complex models that can overfit and are not well understood (being ‘black box’ in nature). Identifying and understanding groups of patients with similar disease profiles (based on discovered hidden variables) makes it possible to better understand the manner of disease progression in different patients while improving prediction. Here, we explore the use of a stepwise method for incrementally identifying hidden variables based on the Induction Causation (IC*) algorithm. We exploit Dynamic Time Warping (DTW) and hierarchical clustering to cluster patients based upon these hidden variables to begin to uncover their meaning with respect to the complications of Type 2 Diabetes Mellitus (T2DM) patients. Our results reveal that inferring a small number of targeted hidden variables and using them to cluster patients not only leads to an improvement in the prediction accuracy but also assists the explanation of different discovered sub-groups.

Conference: IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2018)

Poster accepted in Intelligent Data Analysis 2018 – Nicky Nicolson

Title: Interactive visualisation of field collected botanical specimen metadata: supporting data mining process development

Authors: Nicky Nicolson (,2, Allan Tucker2

Affiliations: 1. Biodiversity Informatics & Spatial Analysis, RBG Kew (UK), 2. Department of Computer Science, Brunel University London (UK)

Abstract: This slide deck outlines the development and utilisation of an interactive data visualisation tool, developed throughout a PhD level research project. Originally designed to aid initial data exploration and gather expert input, the toolkit was further refined to support process design, quality assurance and refinement by viewing data mining results at known stages of a pipeline process, and to enable visualisation of data aggregations used to define new features for use in predictive models. Newly defined features can be regarded as additional data, feeding back into data exploration and forming an iterative process. The toolkit has contributed to reproducible research by adding tool support and activity logging at one of the loosest stages of the research process.

Conference website:


Paper accepted in IEEE e-science 2018 – Nicky Nicolson

Title: Specimens as research objects: reconciliation across distributed repositories to enable metadata propagation

Authors: Nicky Nicolson (,3, Alan Paton2, Sarah Phillips2, Allan Tucker3

Affiliations: 1. Biodiversity Informatics & Spatial Analysis, RBG Kew (UK), 2.Collections, RBG Kew (UK) 3. Department of Computer Science, Brunel University London (UK)

Abstract: Botanical specimens are shared as long-term consultable research objects in a global network of specimen repositories. Multiple specimens are generated from a shared field collection event; generated specimens are then managed individually in separate repositories and independently augmented with research and management metadata which could be propagated to their duplicate peers. Establishing a data-derived network for metadata propagation will enable the reconciliation of closely related specimens which are currently dispersed, unconnected and managed independently. Following a data mining exercise applied to an aggregated dataset of 19,827,998 specimen records from 292 separate specimen repositories, 36% or 7,102,710 specimens are assessed to participate in duplication relationships, allowing the propagation of metadata among the participants in these relationships, totalling: 93,044 type citations, 1,121,865 georeferences, 1,097,168 images and 2,191,179 scientific name determinations. The results enable the creation of networks to identify which repositories could work in collaboration. Some classes of annotation (particularly those regarding scientific name determinations) represent units of scientific work: appropriate management of this data would allow the accumulation of scholarly credit to individual researchers: potential further work in this area is discussed.

Conference website:


Paper accepted in IntelliSys 2018 – Samy Ayed

Title: An Exploratory Study of the Inputs for Ensemble Clustering Technique as a Subset Selection Problem

Authors: Samy Ayed, Mahir Arzoky, Stephen Swift, Steve Counsell and Allan Tucker

Abstract: Ensemble and Consensus Clustering address the problem of unifying multiple clustering results into a single output to best reflect the agreement of input methods. They can be used to obtain more stable and robust clustering results in comparison with a single clustering approach. In this study, we propose a novel subset selection method that looks at controlling the number of clustering inputs and datasets in an efficient way. The authors propose a number of manual selection and heuristic search techniques to perform the selection. Our investigation and experiments demonstrate very promising results. Using these techniques can ensure better selection methods and datasets for Ensemble and Consensus Clustering and thus more efficient clustering results.

Conference:  Intelligent Systems Conference (IntelliSys) 2018, London.

The paper will publish in Springer LNCS Proceedings.

New PhD student

Welcome to Ben Evans who has been awarded a prestigious London NERC DTP scholarship on the project “A global canonical image data set for automatic species classification” working with the Zoological Society of London and Google.


New Seminar Series in IDA

We have a new funded seminar series in the IDA group starting in October 2018 on the theme of “Opening the Black Box”.

Please look out for details on the website here in the upcoming months

Summer Short Course – Data Analysis and R (11th-12th Jul)

Making Sense out of Software Engineering Data And an introduction to R

Prof Sandro Morasca, Università degli studi dell’Insubria, Italy

The FREE summer short course (funded by Erasmus+) was organised by Prof Martin Shepperd on 11-12 July, 2018 (13:00-17:00 in WLFB208).

The course addressed the techniques that can be sensibly used to extract knowledge out of Software Engineering data acquired via experiments or routine data collection in industrial contexts, to make it practically useful. The course described and critically discussed a number of data analysis techniques, by explaining their preconditions and their outcomes. The course illustrated both basic, traditional techniques and innovative ones, like those based on Robust Regression or machine learning.  Also, it  explained how the models obtained can be validated.

A big thank you to Sandro and Martin for running this fantastic short course.

Lecture slides can be found here.