Opening the Black Box (2018-2019) 

Both the commercial and academic sector are exploring the use of their state-of-the-art algorithms to make important decisions, for example in healthcare. These algorithms exploit a heterogeneous mix of on-body sensor data, clinical test results, socio-economic information, and digitised electronic health records. A major issue is that many of the algorithms on offer are often black box in nature (defined as a system which can be viewed in terms of its inputs and outputs without any knowledge of its internal workings). This is because the algorithms are often extremely complex with many parameters (such as deep learning) and also because the algorithms themselves are now valuable commodities. Not knowing the underlying mechanisms of these black box systems is a problem for two important reasons. Firstly, if the predictive models are not transparent and explainable, we lose the trust of experts such as healthcare practitioners. Secondly, without access to the knowledge of how an algorithm works we cannot truly understand the underlying meaning of the output. These problems need to be addressed if we are to make new insights into data such as disease and health.

This seminar series will include talks from experts from statistics, computer science, psychology, and health / medicine in order to address this issue head-on. The seminar will focus on building a network of experts in state-of-the-art technologies that exploit the huge data resources available, while ensuring that these systems are explainable to domain experts. This will result in systems that not only generate new insights but are also more fully trusted.


  • TBC: Pedro Rodrigues, University of Porto
  • TBC: Sandra Wachter, The Oxford Internet Institute

(All seminars held at 3pm in WLFB 207/208)

Previous Seminars

  • July 2018: “Temporal Information Extraction from Clinical Narratives”
    Natalia Viani, King’s College London Electronic health records represent a great source of valuable information for both patient care and biomedical research. Despite the efforts put into collecting structured data, a lot of information is available only in the form of free-text. For this reason, developing natural language processing (NLP) systems that identify clinically relevant concepts (e.g., symptoms, medication) is essential. Moreover, contextualizing these concepts from the temporal point of view represents an important step.
    Over the past years, many NLP systems have been developed to process clinical texts written in English and belonging to specific medical domains (e.g., intensive care unit, oncology). However, research for multiple languages and domains is still limited. Through my PhD years, I applied information extraction techniques to the analysis of medical reports written in Italian, with a focus on the cardiology domain. In particular, I explored different methods for extracting clinical events and their attributes, as well as temporal expressions.
    At the moment, I am working on the analysis of mental health records for patients with a diagnosis of schizophrenia, with the aim to automatically identify symptom onset information starting from clinical notes.Dr Viani is a postdoctoral research associate at the Department of Psychological Medicine, NIHR Biomedical Research Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London. She received her PhD in Bioengineering and Bioinformatics from the Department of Electrical, Computer and Biomedical Engineering, University of Pavia, in January 2018. During her PhD, she spent six months as a visiting research scholar in the Natural Language Processing Laboratory at the Computational Health Informatics Program at Boston Children’s Hospital – Harvard Medical School.
    Her research interests are natural language processing, clinical and temporal information extraction, and biomedical informatics. I am especially interested in the reconstruction of clinical timelines starting from free-text.
  • March 2018: “Optimal Low-dimensional Projections for Spectral Clustering”
    Nicos Pavlidis, Lancaster University The presentation will discuss the problem of determining the optimal low dimensional projection for maximising the separability of a binary partition of an unlabelled dataset, as measured by spectral graph theory. This is achieved by finding projections which minimise the second eigenvalue of the graph Laplacian of the projected data, which corresponds to a non-convex, non-smooth optimisation problem. It can be shown that the optimal univariate projection based on spectral connectivity converges to the vector normal to the maximum margin hyperplane through the data, as the scaling parameter is reduced to zero. This establishes a connection between connectivity as measured by spectral graph theory and maximal Euclidean separation.
  • December 2016: “The value of evaluation: towards trustworthy machine learning”
    Peter Flach, University of Bristol Machine learning, broadly defined as data-driven technology to enhance human decision making, is already in widespread use and will soon be ubiquitous and indispensable in all areas of human endeavour. Data is collected routinely in all areas of significant societal relevance including law, policy, national security, education and healthcare, and machine learning informs decision making by detecting patterns in the data. Achieving transparency, robustness and trustworthiness of these machine learning applications is hence of paramount importance, and evaluation procedures and metrics play a key role in this.In this talk I will review current issues in theory and practice of evaluating predictive machine learning models. Many issues arise from a limited appreciation of the importance of the scale on which metrics are expressed. I will discuss why it is OK to use the arithmetic average for aggregating accuracies achieved over different test sets but not for aggregating F-scores. I will also discuss why it is OK to use logistic scaling to calibrate the scores of a support vector machine but not to calibrate naive Bayes. More generally, I will discuss the need for a dedicated measurement theory for machine learning that would use latent-variable models such as item-response theory
    from psychometrics in order to estimate latent skills and capabilities from observable traits.
  • October 2016: “On Models, Patterns, and Prediction”
    Jaakko Hollmén, Aalto University, Helsinki Pattern discovery has been the center of attention of data mining research for a long time, with patterns languages varying from simple to complex, according to the needs of the applications and the format of data. In this talk, I will take a view on pattern mining that combines elements from neighboring areas. More specifically, I will describe our previous research work in the intersection of the three areas: probabilistic modeling, pattern mining and predictive modeling. Clustering in the context of pattern mining will be explored, as well as linguistic summarization patterns. Also, multiresolution pattern mining as well as semantic pattern discovery and pattern visualization will be visited. Time allowing, I will speak about patterns of missing data and its implications on predictive modeling.Jaakko Hollmén is faculty member at the Department of Computer Science at Aalto University in Espoo, Finland. He received his doctoral degree with distinction in 2000. His research interests include data analysis, machine learning and data mining, with applications in health and in environmental informatics. He has chaired several conferences in his areas of interest, including IDA, DS, IEEE Computer-Based Medical Systems. Currently, he is co-chair of the Program Committee of ECML PKDD 2017, which is organized in Skopje, Macdonia during September 19-23, 2017. His publications can be found at:
  • May 2016: “Beyond Clinical Data Mining: Electronic Phenotyping for Research Cohort Identification”
    John Holmes, University of Pennsylvania The availability of ever-increasing amounts of highly heterogeneous clinical data poses both opportunities and challenges for the data scientist and clinical researcher. Electronic medical records are more prevalent than ever, and now we see that other data sources contribute greatly to the clinical research enterprise. These sources provide genetic, image, and environmental data, just to name three. Now, it is possible to investigate the effects of built environment, such as the availability of food markets, sidewalks, and playgrounds, coupled with clinical observations noted in in the process of providing patient care, along with identified genetic variants that could predispose one to diabetes mellitus. Furthermore, these data could be used in a truly integrated sense to manage such patients more effectively than relying solely on the traditional medical record. The opportunity for enhanced clinical research is manifest in this expanding data and information ecosystem. The challenges are more subtly detected, but present nonetheless. Merging these heterogeneous data into an analyzable whole depends on the availability of a robust unique identifier that has yet to be created, at least in the US. As a result, researchers have developed various probabilistic methods of record matching, occasionally at the expense of data privacy and confidentiality. Another challenge is the sheer heterogeneity of the data; it is not easy to understand the clinical context of an image or waveform without their semantic integration with clinical observation data. In addition, there is the problem of ecologic fallacy, which arises from using data that have no real connection to a clinical record in the service of proposing or testing hypotheses. This problem is quite evident when coupling environmental and clinical data: just because there is a well-stocked market with a surfeit of inexpensive, healthy food options in a person’s neighborhood doesn’t mean that that person avails herself of these items. Finally, there is the problem of data quality. Much of the data we use- whether collected by us or obtained from another source- is replete with problems, such as missingness, contradictions, and errors in representation. We will explore in detail the opportunities and challenges posed to informatics and clinical researchers as they are faced with these seemingly endless sources of data. We will also discuss novel approaches to mining these complex, heterogeneous data for the purpose of constructing cohorts for research.John Holmes is Professor of Medical Informatics in Epidemiology at the University of Pennsylvania Perelman School of Medicine. He is the Associate Director of the Penn Institute for Biomedical Informatics and is Chair of the Graduate Group in Epidemiology and Biostatistics. Dr. Holmes’ research interests are focused on several areas in medical informatics, including evolutionary computation and machine learning approaches to knowledge discovery in clinical databases (data mining), interoperable information systems infrastructures for epidemiologic surveillance, regulatory science as it applies to health information and information systems, clinical decision support systems, semantic analysis, shared decision making and patient-physician communication, and information systems user behavior. Dr. Holmes is a principal or co-investigator on projects funded by the National Cancer Institute, the Patient-Centered Outcomes Research Institute, the National Library of Medicine, and the Agency for Healthcare Research and Quality, and he is the principal investigator of the NIH-funded Penn Center of Excellence in Prostate Cancer Disparities. Dr. Holmes is engaged with the Botswana-UPenn Partnership, assisting in building informatics education and clinical research capacity in Botswana. He leads the evaluation of the National Obesity Observational Studies of the Patient-Centered Clinical Research Network. Dr. Holmes is an elected Fellow of the American College of Medical Informatics and the American College of Epidemiology