Poster accepted in Intelligent Data Analysis 2018 – Leila Yousefi

Title: Opening the Black Box: Discovering and Explaining Hidden Variables in Patient Modelling

Authors: Leila Yousefi1, Stephen Swift1, Mahir Arzoky1, Allan Tucker1, Lucia Saachi2 and Luca Chiovato2

Affiliations: 1. Brunel University London (UK), 2. University of Pavia, Instituti Maugeri (Italy)

Conference website

Spotlight Presentation Slides: Here

Poster accepted in Intelligent Data Analysis 2018 – Nicky Nicolson

Title: Interactive visualisation of field collected botanical specimen metadata: supporting data mining process development

Authors: Nicky Nicolson (,2, Allan Tucker2

Affiliations: 1. Biodiversity Informatics & Spatial Analysis, RBG Kew (UK), 2. Department of Computer Science, Brunel University London (UK)

Abstract: This slide deck outlines the development and utilisation of an interactive data visualisation tool, developed throughout a PhD level research project. Originally designed to aid initial data exploration and gather expert input, the toolkit was further refined to support process design, quality assurance and refinement by viewing data mining results at known stages of a pipeline process, and to enable visualisation of data aggregations used to define new features for use in predictive models. Newly defined features can be regarded as additional data, feeding back into data exploration and forming an iterative process. The toolkit has contributed to reproducible research by adding tool support and activity logging at one of the loosest stages of the research process.

Conference website:


Paper accepted in IEEE e-science 2018 – Nicky Nicolson

Title: Specimens as research objects: reconciliation across distributed repositories to enable metadata propagation

Authors: Nicky Nicolson (,3, Alan Paton2, Sarah Phillips2, Allan Tucker3

Affiliations: 1. Biodiversity Informatics & Spatial Analysis, RBG Kew (UK), 2.Collections, RBG Kew (UK) 3. Department of Computer Science, Brunel University London (UK)

Abstract: Botanical specimens are shared as long-term consultable research objects in a global network of specimen repositories. Multiple specimens are generated from a shared field collection event; generated specimens are then managed individually in separate repositories and independently augmented with research and management metadata which could be propagated to their duplicate peers. Establishing a data-derived network for metadata propagation will enable the reconciliation of closely related specimens which are currently dispersed, unconnected and managed independently. Following a data mining exercise applied to an aggregated dataset of 19,827,998 specimen records from 292 separate specimen repositories, 36% or 7,102,710 specimens are assessed to participate in duplication relationships, allowing the propagation of metadata among the participants in these relationships, totalling: 93,044 type citations, 1,121,865 georeferences, 1,097,168 images and 2,191,179 scientific name determinations. The results enable the creation of networks to identify which repositories could work in collaboration. Some classes of annotation (particularly those regarding scientific name determinations) represent units of scientific work: appropriate management of this data would allow the accumulation of scholarly credit to individual researchers: potential further work in this area is discussed.

Conference website:


Paper accepted in IntelliSys 2018 – Samy Ayed

Title: An Exploratory Study of the Inputs for Ensemble Clustering Technique as a Subset Selection Problem

Authors: Samy Ayed, Mahir Arzoky, Stephen Swift, Steve Counsell and Allan Tucker

Abstract: Ensemble and Consensus Clustering address the problem of unifying multiple clustering results into a single output to best reflect the agreement of input methods. They can be used to obtain more stable and robust clustering results in comparison with a single clustering approach. In this study, we propose a novel subset selection method that looks at controlling the number of clustering inputs and datasets in an efficient way. The authors propose a number of manual selection and heuristic search techniques to perform the selection. Our investigation and experiments demonstrate very promising results. Using these techniques can ensure better selection methods and datasets for Ensemble and Consensus Clustering and thus more efficient clustering results.

Conference:  Intelligent Systems Conference (IntelliSys) 2018, London.

The paper will publish in Springer LNCS Proceedings.

Summer Short Course – Data Analysis and R (11th-12th Jul)

Making Sense out of Software Engineering Data And an introduction to R

Prof Sandro Morasca, Università degli studi dell’Insubria, Italy

The FREE summer short course (funded by Erasmus+) was organised by Prof Martin Shepperd on 11-12 July, 2018 (13:00-17:00 in WLFB208).

The course addressed the techniques that can be sensibly used to extract knowledge out of Software Engineering data acquired via experiments or routine data collection in industrial contexts, to make it practically useful. The course described and critically discussed a number of data analysis techniques, by explaining their preconditions and their outcomes. The course illustrated both basic, traditional techniques and innovative ones, like those based on Robust Regression or machine learning.  Also, it  explained how the models obtained can be validated.

A big thank you to Sandro and Martin for running this fantastic short course.

Lecture slides can be found here.

IDA Meeting (4th Jul 2018)

IDA meeting held at WLFB 207/208 (2nd floor of Wilfred Brown) at 3:00PM

Talk from Natalia Viani, King’s College London

Electronic health records represent a great source of valuable information for both patient care and biomedical research. Despite the efforts put into collecting structured data, a lot of information is available only in the form of free-text. For this reason, developing natural language processing (NLP) systems that identify clinically relevant concepts (e.g., symptoms, medication) is essential. Moreover, contextualizing these concepts from the temporal point of view represents an important step.
Over the past years, many NLP systems have been developed to process clinical texts written in English and belonging to specific medical domains (e.g., intensive care unit, oncology). However, research for multiple languages and domains is still limited. Through my PhD years, I applied information extraction techniques to the analysis of medical reports written in Italian, with a focus on the cardiology domain. In particular, I explored different methods for extracting clinical events and their attributes, as well as temporal expressions. At the moment, I am working on the analysis of mental health records for patients with a diagnosis of schizophrenia, with the aim to automatically identify symptom onset information starting from clinical notes.

Dr Viani is a postdoctoral research associate at the Department of Psychological Medicine, NIHR Biomedical Research Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London. She received her PhD in Bioengineering and Bioinformatics from the Department of Electrical, Computer and Biomedical Engineering, University of Pavia, in January 2018. During her PhD, she spent six months as a visiting research scholar in the Natural Language Processing Laboratory at the Computational Health Informatics Program at Boston Children’s Hospital – Harvard Medical School. Her research interests are natural language processing, clinical and temporal information extraction, and biomedical informatics. I am especially interested in the reconstruction of clinical timelines starting from free-text.

Slides from the talk can be found here.

Machine Learning Reading Group (4th Jul 2018)

The Machine Learning Reading Group was held on 04/07/2018 1:30 PM (IDA/BSEL Lab). The core concept for this meeting is Random forests  and the proposed article to discuss is  “Prediction of the FIFA World Cup 2018 – A random forest approach with an emphasis on estimated team ability parameters”

A short presentation on Random forests can be found here.

Machine Learning Reading Group (12th Jun 2018)

The Machine Learning Reading Group was held on 12/06/2018 11:00 AM (IDA/BSEL Lab) on Reinforcement learning . It was led by Dr Alina Miron.

The core concept for the meeting was Reinforcement learning and the article discussed was “Mastering the game of Go without human knowledge”

A short presentation on reinforcement learning can be found here.

IDA Meeting (8th Feb 2018)

IDA meetings will now be held at our new IDA-BSEL Research Group Laboratory – WBB 208 (2nd floor of Wilfred Brown)

Today’s talks are from:

Samy Ayed on an exploratory study of the inputs for ensemble clustering technique as a subset selection problem (PDF Slides can be found HERE).

Leila Yousefi and Weibuo Liu both discussed Deep Learning and how latent variables are used in their PhDs (PDF Slides can be found HERE).

BIOIMAGING 2018 Best Student Paper Award – Bashir Dodo

We are pleased to announce that Bashir Dodo’s paper “Graph-Cut Segmentation of Retinal Layers from OCT Images” has won the BIOIMAGING 2018 Best Student Paper Award.

Below is the abstract and full list of authors.

The segmentation of various retinal layers is vital for diagnosing and tracking progress of medication of various ocular diseases. Due to the complexity of retinal structures, the tediousness of manual segmentation and variation from different specialists, many methods have been proposed to aid with this analysis. However image artifacts, in addition to inhomogeneity in pathological structures, remain a challenge, with negative influence on the performance of segmentation algorithms. Previous attempts normally pre-process the images or model the segmentation to handle the obstruction but it still remains an area of active research, especially in relation to the graph based algorithms. In this paper we present an automatic retinal layer segmentation method, which is comprised of fuzzy histogram hyperbolization and graph cut methods to segment 8 boundaries and 7 layers of the retina on 150 OCT B-Sans images, 50 each from the temporal, nasal and centre of foveal region. Our method shows positive results, with additional tolerance and adaptability to contour variance and pathological inconsistency of the retinal structures in all regions.

Bashir Isa Dodo, Yongmin Li, Khalid Eltayef and Xiaohui Liu.

Congratulations again!