Part of the “Exploiting Simulation, AI & Knowledge to Improve Healthcare” Series
First Talk: Synthetic data and membership inference attacks
Abstract: Synthetic data is seen as a very promising solution to share individual-level data while limiting privacy risks. The promise of synthetic data lies in the fact that it is generated by sampling new values from a statistical model. Since the generated records are artificial, direct reidentification of individuals by singling out their record in the dataset is not possible. Synthetic data, if truly privacy-preserving, can be shared and used freely as it would no longer fall under the scope of data protection regulations such as the EU’s GDPR. Researchers have however shown that synthetic data is not automatically privacy-preserving. This is because the statistical models used to generate synthetic data, so-called generative models, are fitted on real data in order to approximate the true data distribution, and models can leak information about their training dataset.
In this talk, I will focus on membership inference attacks (MIA) which are the standard tool to evaluate privacy leakage in data releases, including machine learning models trained on sensitive datasets and, more recently, synthetic datasets. MIAs aim to infer whether a particular sample was part of the private data used to train the generative model. I will describe the challenges of MIAs and dive deeper into two of my recent works on the topic. First, I will describe a method to identify vulnerable records of the private dataset on which the generative model is trained, using MIA risk as a measure of vulnerability. Second, I will describe a new MIA which removes an assumption commonly made in previous works about the adversary’s background knowledge. More specifically, this MIA can be performed using only synthetic data to learn a distinguishing boundary between releases trained with or without a particular record of interest.
Bio: Ana-Maria Cretu is a postdoctoral researcher in the SPRING Lab at EPFL in Switzerland, where she works on privacy and security. She is a recipient of the CYD Distinguished Postdoctoral Fellowship of the Swiss Cyber-Defense Campus. She completed her PhD in 2023 at Imperial College London where she was supervised by Dr. Yves-Alexandre de Montjoye. In her thesis, she studied privacy and security vulnerabilities in modern data processing systems, including machine learning models, query-based systems, and synthetic data, developing new methods for automated auditing of such systems. Through a rigorous study of privacy vulnerabilities, her research aims to inform the design of principled countermeasures allowing to prevent them and, ultimately, to use data safely. Ana-Maria holds an MSc in Computer Science from EPFL, Switzerland and a BSc and MSc from Ecole Polytechnique, France. She was a visiting researcher at the University of Oxford where she worked on deep learning techniques for natural language processing. She did two internships at Google (2016 and 2017), one at Twitter (2020) and one at Microsoft (2022).
Second Talk: Dante’s Uncertainty: A story of AIngels and dAImons in COVID modelling
Abstract: My goal during the pandemic was to review epidemiological COVID models, but as the need arose I ended up both reviewing and creating them. Navigating the Nine* Layers of COVID uncertainty, I discovered where my lack of AI expertise led to serious development issues, where the AIngels shone brightest, and where the “dAImons” efforts ruined the party for more serious modellers. When it comes to COVID forecasting, all of the models are wrong and often less than useful, but nevertheless as a community we ended up with a much better understanding of pandemics and pandemic preparation than before. In my talk I will try to explain why we actually did end up being more able to do forecasts, whilst also showing a fairly sobering perspective on the enormous uncertainty that surrounds epidemic modelling forecasts (and some AI and other tools to address that uncertainty).
*open to interpretation: two layers or forty layers could also be justifiable
Bio: Derek Groen is a Reader in Computer Science at Brunel University London, and a Visiting Lecturer at University College London. He has a PhD from the University of Amsterdam (2010) in Computational Astrophysics, and was a Post-Doctoral Researcher at UCL for five years prior to joining Brunel as Lecturer. Derek has a strong interest in high performance simulations, multiscale modelling and simulation, and so-called VVUQ (verification, validation and uncertainty quantification). In terms of applications he is a lead developer on the Flee migration modelling code and the Flu And Coronavirus Simulator (FACS) COVID-19 model. He has also previously worked on applications in astrophysics, materials and bloodflow. Derek has been PI for Brunel in two large and recent Horizon 2020 research projects (VECMA on uncertainty quantification, and HiDALGO on global challenge simulations) and he is currently the technical manager of the UK-funded SEAVEA project which develops a VVUQ toolkit for large-scale computing applications (seavea-project.org). His most recent publication (at time of writing) is a software paper about the FabSim3 research automation toolkit, which was selected as a Feature Paper for Computer Physics Communications.
- Do you need permission to train AI on copyright protected data?
- Should AI-generated content be protected by copyright?
- Can AI-generated content infringe copyright holders’ rights?
- To explore how simulations and synthetic data can augment machine learning to improve performance, explainability and reduce bias in healthcare models
- To investigate the privacy of computational methods including how synthetic data, simulations and federated learning can protect against privacy attacks
- To survey how important the trust of AI-system users is in critical decision-making
10th May 2023 @ 3pm in WBB207/208 –3 Talks:
- The Impact of Bias on Drift Detection in AI Health Software, Asal Khoshravan Azar
Despite the potential of AI in healthcare decision-making, there are also risks to the public for different reasons. Bias is one risk: any data unfairness present in the training set, such as the under-representation of certain minority groups, will be reflected by the model resulting in inaccurate predictions. Data drift is another concern: models trained on obsolete data will perform poorly on newly available data. Approaches to analysing bias and data drift independently are already available in the literature, allowing researchers to develop inclusive models or models that are up-to-date. However, the two issues can interact with each other. For instance, drifts within under-represented subgroups might be masked when assessing a model on the whole population. To ensure the deployment of a trustworthy model, we propose that it is crucial to evaluate its performance both on the overall population and across under-represented cohorts. In this talk, we explore a methodology to investigate the presence of drift that may only be evident in sub-populations in two protected attributes, i.e., ethnicity and gender. We use the BayesBoost technique to capture under-represented individuals and to boost these cases by inferring cases from a Bayesian network. Lastly, we evaluate the capability of this technique to handle some cases of drift detection across different sub-populations.
- Privacy Assessment of Synthetic Patient Data, Ferdoos Hossein Nezhad
In this talk, we quantify the privacy gain of synthetic patient data drawn from two generative models, MST
and PrivBayes, which is based on real anonymized primary care patient data. This evaluation is implemented for two types of inference attacks, namely membership and attribute inference attacks using a new toolbox, TAPAS. The aim is to quantitatively evaluate the privacy gain of each attack where these two differentially private generators and different threat models are used with a focus on black-box knowledge. The evaluation that was carried out in this paper demonstrates that vulnerabilities of
synthetic patient data depend on the different attack scenarios, threat models, and algorithms used to generate the synthetic patient data. It was shown empirically that although the synthetic patient data achieved high privacy gain in most attack scenarios, it does not behave uniformly against adversarial attacks, and some records and outliers remain vulnerable depending on the attack scenario. Moreover, it was shown that the PrivBayes generator is the more robust generator in comparison to MST in terms of the privacy-preservation of synthetic data.
- Creating Synthetic Geospatial Patient Data to Mimic Real Data Whilst Preserving Privacy, Dima Alattal
Synthetic Individual-Level Geospatial Data (SILGSD) offers a number of advantages in Spatial Epidemiology when compared to census data or surveys conducted on regional or global levels. The use of SILGSD will bring a new dimension to the study of the patterns and causes of diseases in a particular
location while minimizing the risk of patient identity disclosure, especially for rare conditions. Additionally, SILGSD will help in building and monitoring more stable machine learning models in
local areas, improving the quality and effectiveness of healthcare services. Finally, SILGSD will be highly effective in controlling the spread and causes of diseases by studying disease movement across areas through the commuting patterns of those affected. To our knowledge, no real or synthetic health records data containing geographic locations for patients has been published for research purposes so far. Therefore, in this talk we provide SILGSD by allocating synthetic patients to general practices (healthcare providers) in the UK using the prevalence of health conditions in each practice. The assigned general practice locations are used as physical geolocations for the patients because in reality the patients are registered in the nearest practice to their homes. To generate high fidelity data we allocate synthetic primary care patients from the Clinical Practice Research Datalink (CPRD) instead of real patients to England’s general practices (GPs), using the publicly available GP health conditions statistics from the Quality and Outcomes Framework (QOF) without using more precise data. Further, the allocation relies on the similarities between patients in different locations without using the real location for the patients. We demonstrate that the Allocation Data is able to accurately mimic the real health conditions distribution in the general practices and also preserves the underlying distribution of the original primary care patients data from CPRD (Gold Standard).
Bio: Dr. Pavithra Rajendran currently works at DRIVE unit within Great Ormond Street Hospital NHS Foundation Trust as the NLP Technical Lead. Previously, worked at KPMG UK as NLP Data Scientist, using both traditional and deep learning-based NLP techniques for various client projects in both public and private sectors, from Proof-of-Concept to Production (Healthcare, Oil and Gas, Travel, Finance etc.). She received her PhD degree in Computer Science from University of Liverpool and her research interests includes Natural Language Processing and applications of NLP within the healthcare domain and Argument Mining.
Abstract: Genomic testing has the potential to deliver precision medicine by providing a greater understanding on diagnosis and treatments that can benefit patients. Often, the genomic test reports are written by clinical scientists and stored as unstructured data in the form of PDFs, which makes it a challenge for secondary usage (e.g. research) and clinical decision making. In this talk, I will explain about the end to end pipeline developed for extracting relevant information from genomic reports with a brief overview on the NLP techniques used.
In this talk, Matloob will summarise his major research themes: i) DNA binding proteins called transcription factors (TFs) regulate various cell functions and play a key role in the development and progression of genetic diseases such as cancer. His work in bioinformatics and microscopy imaging identified DNA locations & TFs that involve in various diseases including breast cancer. His bioinformatics expertise has won him £355K from NERC, UKRI this year. ii) Social media platforms such as Twitter and Reddit have become valuable sources of information for public health surveillance applications. Matloob will summarise some of his recent works in natural language processing. iii) Financial markets are very dynamic and (over)react to all types of economic news making it difficult to predict the prices of financial instruments. Matloob will share some of his algorithms for dealing with the issue.
Allan Tucker, Ylenia Rotalinti, Barbara Draghi, Awad Alyousef
Dr Yiming Wang: A Transfer Learning-based Method for Defect Detection in Additive Manufacturing
Yani Xue – Many-objective optimization and its application in forced migration
Abstract: Many-objective optimization is core to both artificial intelligence and data analytics as real-world problems commonly involve multiple objectives which are required to be optimized simultaneously. A large number of evolutionary algorithms have been developed to search for a set of Pareto optimal solutions for many-objective optimization problems. It is very rare that a many-objective evolutionary algorithm performs well in terms of both effectiveness and efficiency, two key evaluation criteria. Some algorithms may struggle to guide the solutions towards the Pareto front, e.g., Pareto-based algorithms, while other algorithms may have difficulty in diversifying the solutions evenly over the front on certain problems, e.g., decomposition-based algorithms. Furthermore, some effective algorithms may become very computationally expensive as the number of objectives increases, e.g., indicator-based algorithms.
In this talk, we will investigate how to make evolutionary algorithms perform well in terms of effectiveness and efficiency in many-objective optimization. We will show 1) how to improve the effectiveness of conventional Pareto-based algorithms, 2) how to further enhance the effectiveness of leading many-objective evolutionary algorithms in general, 3) how to strike a balance between effectiveness and efficiency of evolutionary algorithms when solving many-objective optimization problems, and 4) how to apply evolutionary algorithms to a real-world case.
Biography: Dr Yani Xue is currently a research fellow in multi-objective optimization at Brunel University London, UK. She received the Ph.D. degree in computer science from Brunel University London, UK, in 2021. Her main research interests include evolutionary computation, multi-objective optimization, search-based software engineering, and engineering applications.
Futra Fadzil – Phase-wise, Constrained, Many Objective Optimization Problem in Mining Industry
Abstract: The digital mine process exhibits three distinct features, that is: (1) phase-wise evolution leading to dynamic changes of objective functions as the phase changes, (2) physical/resource constraints leading to feasibility challenges to the optimization problems, and (3) many-objective characteristics, including the design process, energy gains, platform, operational profile, mine management and finally the life-cycle costs. Traditionally, the employed optimization techniques include a constrained programming approach, ant colony optimization, fuzzy logic, evolutionary algorithms, and combinatorial techniques. However, all existing results have been limited to certain particular parts of the life-cycle of the digital mine process, and there has been very little effort devoted to the optimization on the full life-cycle of the mining process consisting of several phases over time that includes energy, waste, emissions, ventilation, routes, and cooling. The proposed concept of phase-wise, constrained, many objective optimization (PWCMOO) smart scheduling tool stems from mine production practice, is completely new and opens a new branch of research for both computational intelligence and engineering design communities, which demands novel approaches if we are to advance significantly beyond state of the art. Working across the system modelling, optimization and evolutionary computation, we are set to develop: 1) a novel algorithm for phase-wise optimization problem (POP) that caters for the switching phenomena between the phases; 2) a novel computational framework for phase-based evolutionary computation that balances between convergence and diversity among different phases.
In this talk, we are formulating three optimization problems in the mining activities: the open pit stability problem, the
truck movement scheduling problem, and the water discharge monitoring problem. All these phases have their independent objectives and constraints. Later, we integrate it as an optimal scheduling problem in the intelligence layer of a human-centred Internet of a Thing platform for the sustainable digital mine of a future.
Biography: Futra Fadzil is a Research Fellow at Computer Science Department. He currently works on Horizon 2020 DIG IT project which focuses on many objective optimization and smart scheduling for the sustainable digital mine of the future. He received his PhD degree in Electrical Engineering and Electronic from Brunel University London in 2020. Over the years he has gained experience in the power industry and participated in numerous research projects in the following areas: electrical & instrumentation; operation and maintenance; project management; industrial data acquisition; real-time data analytics; system modelling, system optimization, machine learning and industrial internet of thing (IIoT)
23rd June 2021 @ 2pm – Dr James Westland Cain, Grass Valley
DeepFakes – What are they and why should I care?
9th June Panagiotis Papapetrou, Stockholm University, “Interpretable feature learning and classification: from time series counterfactuals to temporal abstractions in medical records
12th May 2021 – Stefano Vacca, Alkemy S.p.A (Milan, Italy):
Listening to social media through Natural Language Processing and Geo Intelligence
Abstract: Social Data Intelligence is a particular form of data analysis that is focused on social media data. On the web and especially on social networks, vast amounts of different kinds of data are produced. People use Twitter, Instagram, Tik Tok, Fortnight, Twitch, and write down their impressions, opinions, and feelings about political events, social phenomenon, and famous people. This seminar aims to show, through practical use cases, how it is possible to use innovative data mining techniques to extract hidden information from social media to measure a specific phenomenon to help management make decisions around specific trends (and discover new opportunities) or how to use these techniques for research purposes.
Biography: Stefano Vacca is a Data Scientist at Alkemy S.p.A (Milan, Italy) since 2018. He holds a bachelor’s degree in Economics and Finance, defending a thesis focused on Smart Cities’ European legislation. In 2020 he obtained a master’s degree in Data Science, Business Analytics and Innovation at the University of Cagliari (Italy), bringing a thesis entitled “Hawkes processes for the identification of premonitory events in the cryptocurrency market”. In 2019 he worked on a project for Enlightenment.ai company (Lisbon, Portugal) to construct computer vision algorithms to recognise digits starting from counter meter images. Stefano regularly gives seminars for both academia and industry and he has published several research papers in the field of data mining for cryptocurrencies.
14th Oct at 3pm: Lianghao Han Biomechanically Informed Image Registration Framework for Organ Motion and Tissue Deformation Estimations in Medical Image Analysis and Image-Guided Intervention
Abstract: Organ motion and tissue deformation are two big challenges in medical image analysis and image-guided intervention. In this talk, I will introduce a biomechanically informed image registration framework for estimating organ motion and tissue deformation from multimodal medical images, in which biomechanical models are incorporated into image registration and simulation algorithms. I will also introduce several applications in breast cancer detection, lung cancer radiotherapy and prostate cancer biopsy.
Bio: Dr Lianghao Han is a Senior Research Fellow at the Department of Computer Science, Brunel University. He received his PhD degree in Cambridge University and had worked in the Medical Vision Lab in Oxford University and the Centre for Medical Image Computing in UCL. Before joining Brunel, he was a Professor in Tongji University (P.R. China).
Lianghao’s research interests are in Medical Image Analysis for Cancer Detection and Diagnosis (Lung, Breast, Liver and Prostate), Image-Guided Intervention, Biomechanics and Machine Learning.
30th Sept at 3pm: Lorraine Ayad, MARS: improving multiple circular sequence alignment using refined sequences
3rd June at 3pm: Isabel Sassoon, Applications of Computational Argumentation in Data Driven Decision Support
Opening the Black Box (2018-2020)
Both the commercial and academic sector are exploring the use of their state-of-the-art algorithms to make important decisions, for example in healthcare. These algorithms exploit a heterogeneous mix of on-body sensor data, clinical test results, socio-economic information, and digitised electronic health records. A major issue is that many of the algorithms on offer are often black box in nature (defined as a system which can be viewed in terms of its inputs and outputs without any knowledge of its internal workings). This is because the algorithms are often extremely complex with many parameters (such as deep learning) and also because the algorithms themselves are now valuable commodities. Not knowing the underlying mechanisms of these black box systems is a problem for two important reasons. Firstly, if the predictive models are not transparent and explainable, we lose the trust of experts such as healthcare practitioners. Secondly, without access to the knowledge of how an algorithm works we cannot truly understand the underlying meaning of the output. These problems need to be addressed if we are to make new insights into data such as disease and health.
This seminar series will include talks from experts from statistics, computer science, psychology, and health / medicine in order to address this issue head-on. The seminar will focus on building a network of experts in state-of-the-art technologies that exploit the huge data resources available, while ensuring that these systems are explainable to domain experts. This will result in systems that not only generate new insights but are also more fully trusted.
19 March 2pm: Symbiosis Centre for Applied Artificial Intelligence
Prof. Ketan Kotecha: Head of Symbiosis Institute of Technology, Head of Symbiosis Centre for Applied Artificial Intelligence (SCAAI), Pune University
Dr Rahee Walambe: Associate Professor and Faculty at Symbiosis Centre of Applied AI. She is also Faculty at Symbiosis Institute of Technology, Dept of ENTC.
22 January: Nadine Aburumman , Brunel University London.
11 Dec 2019: Arianna Dagliati, Manchester University
Temporal phenotyping for precision medicine
- Arianna is a Research Fellow in the Manchester Molecular Pathology Innovation Centre, and the Division of Informatics, Imaging & Data Sciences, University of Manchester. Her background is in Bioengineering and Bioinformatics, with broad experience in applying machine learning approaches to knowledge discovery, predictive modeling, temporal and process mining, and software engineering. Near the Manchester Molecular Pathology Innovation Centre her research is dedicated to the discovery of novel biomarker in autoimmune diseases, based on the integration of clinical and multi-omic data. She develop novel analysis pipelines for Precision Medicine analytical approaches for identifying temporal patterns and electronic phenotypes in longitudinal clinical data, and for their exploitation in clinical decision support. Together a multi-disciplinary team of researchers, developers and clinicians, her research is aimed at understanding how under-used health data can be re-purposed to improve health. Working on different research projects, she combines the informatics technology with statistical models to answer scientific questions using data derived from EHR, cohort studies and public data resources. In the past she collaborated with the Harvard Medical School, Informatics for Integrating Biology and the Bedside team for its first implementation in oncologic care in Europe and with the Department of Biostatistics and Epidemiology at the University of Pennsylvania for the development of novel careflow mining approaches for enabling the recognition of temporal patterns and electronic phenotypes in longitudinal clinical data.
Abstract A key trend in current medical research is a shift from a one-size-fit-all to precision treatment strategies, where the focus is on identifying narrow subgroups of the population who would benefit from a given intervention. Precision medicine greatly benefits from algorithms and accessible tools that clinicians can use to identify such subgroups, and to generate novel inferences about the patient population they are treating. Complexity and variability of patients’ trajectories in response to treatment poses significant challenges in many medical fields, especially in those requiring long-term care, longitudinal analytics methods, their exploitation in the context of clinical decisions, and their translation into clinical practice through accessible tools, represents a potential for enabling precision healthcare.
The seminar will discuss challenges for Precision Medicine and approaches to exploit longitudinal data for subgroup discovery in Rheumatoid Arthritis and to identify trajectories representing different temporal phenotypes in Type 2 Diabetes.
11 Dec 2019: Lucia Sacchi, University of Pavia, Italy
Lucia Sacchi is Associate Professor at the Department of Electrical, Computer and Biomedical Engineering at the University of Pavia, Italy. She’s got a Master Degree in Computer Engineering and a PhD in Bioengineering and Bioinformatics, both taken at the University of Pavia. She was post-doctoral fellow at the University of Pavia, Senior Research Fellow at the Brunel University London (UK), and Assistant Professor at the University of Pavia. Her research interests are related to data mining, with particular focus on temporal data, clinical decision support systems, process mining, and technologies for biomedical data analysis.
She is the Chair of the IMIA working group on Data Mining and Big Data Analytics, vice-chair of the board of the Artificial Intelligence in Medicine (AIME) Society, and member of the board of the Italian Society of Biomedical Informatics (SIBIM). She is part of the Editorial Board of BMC Medical Informatics and Decision Making, Artificial Intelligence in Medicine, Journal of Biomedical Informatics (JBI), and she is Academic Editor for PLOS ONE. She has co-authored more than 90 scientific peer-reviewed publications on international journals and international conferences.
Abstract The increasing availability of time-dependent health-related data, both collected in Hospital Information Systems during clinical practice, and by patients who use wearable monitoring devices, offers some interesting research challenges. Among these, enriching clinical decision support systems with advanced tools for the analysis of longitudinal data is of paramount importance. Such tools can be useful to synthesise the patients’ conditions in between encounters to identify critical situations in advance, or to study temporal trajectories of chronic disease evolution to plan timely targeted interventions. This talk will introduce the problem of the analysis of temporal data coming from different sources, and will describe some methodologies that can be useful to analyse heterogeneous data. Moreover, it will present some examples on how such analysis has been integrated in real-world clinical decision support systems.
6th Nov 2019: In house speakers:
Gabriele Scali: Constraint Satisfaction Problems and Constraint Programming
Leila Yousefi: The Prevalence of Errors in Machine Learning Experiments
4th Oct 2019: Jaakko Hollmén, Stockholm University, Department of Computer and Systems Sciences, Sweden
Diagnostic prediction in neonatal intensive care units
- Jaakko Hollmén is a faculty member at Department of Computer and Systems Sciences at Stokcholm University in Sweden (since September 2019). Prior to joining Stokcholm university, he was a faculty member at the Department of Computer Science at Aalto University in Finland. His research interests include theory and practice of machine learning and data mining, in particular in the context of health, medicine and environmental sciences. He has been involved in the organization of many IDA conferences for the past ten years. He is also the secretary of the IDA council.
Abstract: Preterm infants, born before 37 weeks of gestation, are subject to many developmental issues and health problems. Very Low Birth Weight (VLBW) infants, with a birth weight under 1500 g, are the most afflicted in this group. These infants require treatment in the neonatal intensive care unit before they are mature enough for hospital discharge. The neonatal intensive care unit is a data-intensive environment, where multi-channel physiological data is gathered from patients using a number of sensors to construct a comprehensive picture of the patients’ vital signs. We have looked into the problem how to predict neonatal in-hospital mortality and morbidities. We have used time series data collected from Very Low Birth Weight infants treated in the neonatal intensive care unit of Helsinki University Hospital between 1999 and 2013. Our results show that machine learning models based on time series data alone have predictive power comparable with standard medical scores, and combining the two results in improved predictive ability. We have also studied the effect of observer bias on recording vital sign measurements in the neonatal intensive care unit, as well as conducted a retrospective cohort study on trends in the growth of Extremely Low Birth Weight (birth weight under 1000 g) infants during intensive care.
May 15th: John Holmes, University of Pennsylvania
Explainable AI for the (Not-Always-Expert) Clinical Researcher
- John H. Holmes, PhD, is Professor of Medical Informatics in Epidemiology at the University of Pennsylvania Perelman School of Medicine. He is the Associate Director of the Institute for Biomedical Informatics, Director of the Master’s Program in Biomedical Informatics, and Chair of the Doctoral Program in Epidemiology, all at Penn. Dr. Holmes has been recognized nationally and internationally for his work on developing and applying new approaches to mining epidemiologic surveillance data, as well as his efforts at furthering educational initiatives in clinical research. Dr. Holmes’ research interests are focused on the intersection of medical informatics and clinical research, specifically evolutionary computation and machine learning approaches to knowledge discovery in clinical databases, deep electronic phenotyping, interoperable information systems infrastructures for epidemiologic surveillance, and their application to a broad array of clinical domains, including cardiology and pulmonary medicine. He has collaborated as the informatics lead on an Agency for Healthcare Research and Quality-funded project at Harvard Medical School to establish a scalable distributed research network, and he has served as the co-lead of the Governance Core for the SPAN project, a scalable distributed research network; he participates in the FDA Sentinel Initiative. Dr. Holmes has served as the evaluator for the PCORNet Obesity Initiative studies, where he was responsible for developing and implementing the evaluation plan and metrics for the initiative. Dr. Holmes is or has been a principal or co-investigator on projects funded by the National Cancer Institute, the National Library of Medicine, and the Agency for Healthcare Research and Quality, and he was the Penn principal Investigator of the NIH-funded Penn Center of Excellence in Prostate Cancer Disparities. Dr. Holmes is engaged with the Botswana-UPenn Partnership, assisting in building informatics education and clinical research capacity in Botswana. Dr. Holmes is an elected Fellow of the American College of Medical Informatics (ACMI), the American College of Epidemiology (ACE), and the International Academy of Health Sciences Informatics (IAHSI).
Abstract Armed with a well-founded research question, the clinical researcher’s next step is usually to seek out the data that could help answer it, although the researcher can use data to discover a new research question. In both cases, the data will already be available, and so either approach to inquiry can be appropriate and justifiable. However, the next steps- data preparation, analytics, and inference- are often thorny issues that even the most seasoned researcher must address, and sometimes not so easily. Traditional approaches to data preparation, that include such methods as frequency distribution and contingency table analyses to characterize the data are themselves open to considerable investigator bias. In addition, there is considerable tedium resulting from applying these methods- for example, how many contingency tables does it take to identify variable interactions? It is arguable that feature selection and construction are two tasks not to be left only to human interpretation. Yet we don’t see much in the way of novel approaches to “experiencing” data such that new, data-driven insights arise during the data preparation process. The same can be said for analysis, where even state-of-the art statistical methods, informed or driven by pre-formed hypotheses and the results of feature selection processes, sometimes hampers truly novel knowledge discovery. As a result, inferences made from these analyses likewise suffer. However, new approaches to making AI explainable to users, in this case clinical researchers who do not have the time or inclination to develop a deep understanding of how this or that AI algorithm works, are critically important, and their dearth represents a gap that those of us in clinical research informatics need to fill. Yet, the uninitiated shy away from AI for the very lack of explainability. This talk will explore some new methods for making AI explainable, one of which, PennAI, has been developed at the University of Pennsylvania. PennAI will be demonstrated using several sample datasets.
March 13th : Mario Cannataro, Università degli Studi Magna Graecia di Catanzaro
- Mario Cannataro is a Full Professor of Computer Engineering and Bioinformatics at University “Magna Graecia” of Catanzaro, Italy. He is the director of the Data Analytics research centre and the chair of the Bioinformatics Laboratory at University “Magna Graecia” of Catanzaro. His current research interests include bioinformatics, medical informatics, data analytics, parallel and distributed computing. He is a Member of the editorial boards of IEEE/ACM Transaction on Computational Biology and Bioinformatics, Briefings in Bioinformatics, High-Throughput, Encyclopaedia of Bioinformatics and Computational Biology, Encyclopaedia of Systems Biology. He was guest editor of several special issues on bioinformatics and he is serving as a program committee member of several conferences. He published three books and more than 200 papers in international journals and conference proceedings. Mario Cannataro is a Senior Member of IEEE, ACM and BITS, and a member of the Board of Directors for ACM SIGBio.
Abstract: Recently, several factors are moving biomedical research towards a (big) data-centred science:(i) the Volume of data in bioinformatics is having an explosion, especially in healthcare and medicine; (ii) new bioinformatics data is created at increasing Velocity due to advances in experimental platform and increased use of IoT (Internet of Things) health monitoring sensors; (iii) increasing Variety and (iv) Variability of data (omics, clinical, administration, sensors, and social data are inherently heterogeneous) that may lead to wrong modelling, integration and interpretation, and finally (v) increasing Value of data in bioinformatics due to costs of infrastructures to produce and analyze data, as well as, value of extracted biomedical knowledge. The emerging of this Big Data trend in Bioinformatics poses new challenges for computer science solutions, regarding the efficient storage, preprocessing, integration and analysis of omics (e.g. genomics, proteomics, and interactomics) and clinical (e.g. laboratory data, bioimages, pharmacology data, social network data, etc.) data, resulting in a main bottleneck of the analysis pipeline. To face those challenges, main trends are: (i) use of high-performance computing in all steps of analysis pipeline, including parallel processing of raw experimental data, parallel analysis of data, and efficient data visualization; (ii) deployment of data analysis pipelines and main biological databases on the Cloud; (iii) use of novel data models that combine structured (e.g. relational data) and unstructured (e.g. text, multimedia, biosignals, bioimages) data, with special focus on graph databases; (iv) development of novel data analytics methods such as Sentiment Analysis, Affective Computing and Graph Analytics, that integrate traditional statistical and data mining analysis; (v) particular attention to issues regarding privacy of patients, as well as permitted ways to use and analyze biomedical data.
After recalling main omics data, the first part of the talk presents some experiences and applications related to the preprocessing and data mining analysis of omics, clinical and social data, conducted at University Magna Graecia of Catanzaro. Some case studies in the oncology (pharmacogenomics data) and paediatrics (sentiment analysis) domains are also presented. With the availability of large datasets, Deep Learning algorithms have proved to lead to state of the art performance in many different problems, as for example in text classification. However, deep models have the drawback of not being human-interpretable, raising various problems related to model’s interpretability. Model interpretability is another important aspect to be considered in order to develop a Clinical Decision Support System (CDSS) that clinicians can trust. In particular, an interpretable CDSS can ensure that: i) clinicians understand the system predictions (in the sense that predictions are required to be consistent with medical knowledge); ii) the decisions will not negatively affect the patient; iii) the decisions are ethical; iv) the system is optimized on complete objectives; and v) the system is accurate and sensible patient data are protected. Therefore there is the need of new strategies for developing explainable AI systems for supporting medical decisions and, in particular, for presenting human-understandable explanations to clinicians and that can also take into account sentiment analysis or, more in general, explainable text classification methodologies. Recently, the deep network architecture called Capsule Networks has gained a lot of interest, also showing intrinsic properties that can potentially improve model explainability in image recognition. However, to the best of our knowledge, if Capsule Networks might improve explainablity for text classification problems is a point that needs to be further investigated. The second part of this talk will focus on a brief overview of proposed explainable models and then will present some discussion related to how Capsule Networks can be adapted to sentiment classification problems in order to improve explainability.
January 16th: Norman Fenton, Queen Mary, University of London
Pearse A. Keane, MD, FRCOphth, is a consultant ophthalmologist at Moorfields Eye Hospital, London and an NIHR Clinician Scientist, based at the Institute of Ophthalmology, University College London (UCL). Dr Keane specialises in applied ophthalmic research, with a particular interest in retinal imaging and new technologies. In April 2015, he was ranked no. 4 on a worldwide ranking of ophthalmologists under 40, published in “the Ophthalmologist” journal (https://theophthalmologist.com/the-power-list-2015/). In 2016, he initiated a formal collaboration between Moorfields Eye Hospital and Google DeepMind, with the aim of applying machine learning to automated diagnosis of optical coherence tomography (OCT) images. In August 2018, the first results of this collaboration were published in the journal, Nature Medicine.The Moorfields-DeepMind Collaboration – Reinventing the Eye ExaminationOphthalmology is among the most technology-driven of the all the medical specialties, with treatments utilizing high-spec medical lasers and advanced microsurgical techniques, and diagnostics involving ultra-high resolution imaging. Ophthalmology is also at the forefront of many trailblazing research areas in healthcare, such as stem cell therapy, gene therapy, and – most recently – artificial intelligence. In July 2016, Moorfields announced a formal collaboration with the world’s leading artificial intelligence company, DeepMind. This collaboration involves the sharing of >1,000,000 anonymised retinal scans with DeepMind to allow for the automated diagnosis of diseases such as age-related macular degeneration (AMD) and diabetic retinopathy (DR). In my presentation, I will describe the motivation – and urgent need – to apply deep learning to ophthalmology, the processes required to establish a research collaboration between the NHS and a company like DeepMind, the initial results of our research, and finally, why I believe that ophthalmology could be first branch of medicine to be fundamentally reinvented through the application of artificial intelligence.
Norman Fenton is Professor of Risk Information Management at Queen Mary London University and is also a Director of Agena, a company that specialises in risk management for critical systems. Norman is a mathematician by training whose current research focuses on critical decision-making and, in particular, on quantifying uncertainty using a ‘smart data’ that combines data with expert judgment. Applications include law and forensics (Norman has been an expert witness in major criminal and civil cases), health, security, software reliability, transport safety and reliability, finance, and football prediction. Norman has been PI in grants totalling over £10million. He currently leads an EPRSC Digital Health Technologies Project (PAMBAYESIAN) and a Leverhulme Trust grant (CAUSAL-DYNAMICS). In 2014 Norman was awarded a prestigious European Research Council Advanced Grant (BAYES-KNOWLEDGE) in which the ‘smart data’ approach evolved. Since June 2011 he has led an international consortium (Bayes and the Law) of statisticians, lawyers and forensic scientists working to improve the use of statistics in court. In 2016 he led a prestigious 6-month Programme on Probability and Statistics in Forensic Science at the Isaac Newton Institute for Mathematical Sciences, University of Cambridge where he was also a Simons Fellow. He was appointed as a Fellow of The Turing Institute in 2018. In March 2015 Norman presented award-winning BBC documentary Climate Change by Numbers.
Abstract: Misunderstandings about risk, statistics and probability often lead to flawed decision-making in many critical areas such as medicine, finance, law, defence, and transport. The ‘big data’ revolution was intended to at least partly address these concerns by removing reliance on subjective judgments. However, even where (relevant) big data are available there are fundamental limitations to what can be achieved through pure machine learning techniques. This talk will explain the successes and challenges in using causal probabilistic models of risk – based on a technique called Bayesian networks – in providing powerful decision-support and accurate predictions by a ‘smart data’ approach. This combines minimal data with expert judgment. The talk will provide examples in chronic diseases, forensics, terrorist threat analysis, and even sports betting.
- December 12th 2018: Pearse Keane, Moorfields Eye Hospital “Artificial Intelligence in Ophthalmology“.
Niels Peek is Professor of Health Informatics and Strategic Research Domain Director for Digital Health at the University of Manchester. He has a background in Computer Science and Artificial Intelligence, and his research focuses on data-driven methods for health research, healthcare quality improvement, and computerised decision support. From 2013 to 2017 he was the President of the Society for Artificial Intelligence in Medicine (AIME). He is a member of the editorial boards of the Journal of the American Medical Informatics Association and the Artificial Intelligence in Medicine journal. In April 2017, he organised the Informatics for Health 2017 conference in Manchester which was attended by more than 800 people from 30 countries. He also co-chaired the Scientific Programme Committee of MEDINFO-2017, the 16th World Congress on Health and Biomedical Informatics, which was held in Hangzhou, China, in August 2017. In 2018 he was elected to become a fellow of the American Collecege of Medical Informaticians and a fellow of the Alan Turing Institute.
My talk will introduce the concept of “Learning Health Systems” and focus on the role of clinical prediction models within these systems. Building on the distinction between explanatory and predictive models (which is commonly made in statistics and epidemiology but not in computer science) I will review the use of machine learning and statistical modelling in healthcare; discuss the role of model interpretation and transparency in explanatory and predictive models; and discuss the suitability of different analytical methods to facilitate interpretability and transparency
- October 17th: Allan Tucker, “Opening the Black Box“, Brunel University London
- July 2018: “Temporal Information Extraction from Clinical Narratives”
Natalia Viani, King’s College London Electronic health records represent a great source of valuable information for both patient care and biomedical research. Despite the efforts put into collecting structured data, a lot of information is available only in the form of free-text. For this reason, developing natural language processing (NLP) systems that identify clinically relevant concepts (e.g., symptoms, medication) is essential. Moreover, contextualizing these concepts from the temporal point of view represents an important step.
Over the past years, many NLP systems have been developed to process clinical texts written in English and belonging to specific medical domains (e.g., intensive care unit, oncology). However, research for multiple languages and domains is still limited. Through my PhD years, I applied information extraction techniques to the analysis of medical reports written in Italian, with a focus on the cardiology domain. In particular, I explored different methods for extracting clinical events and their attributes, as well as temporal expressions.
At the moment, I am working on the analysis of mental health records for patients with a diagnosis of schizophrenia, with the aim to automatically identify symptom onset information starting from clinical notes.Dr Viani is a postdoctoral research associate at the Department of Psychological Medicine, NIHR Biomedical Research Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London. She received her PhD in Bioengineering and Bioinformatics from the Department of Electrical, Computer and Biomedical Engineering, University of Pavia, in January 2018. During her PhD, she spent six months as a visiting research scholar in the Natural Language Processing Laboratory at the Computational Health Informatics Program at Boston Children’s Hospital – Harvard Medical School.
Her research interests are natural language processing, clinical and temporal information extraction, and biomedical informatics. I am especially interested in the reconstruction of clinical timelines starting from free-text.
- March 2018: “Optimal Low-dimensional Projections for Spectral Clustering”
Nicos Pavlidis, Lancaster University The presentation will discuss the problem of determining the optimal low dimensional projection for maximising the separability of a binary partition of an unlabelled dataset, as measured by spectral graph theory. This is achieved by finding projections which minimise the second eigenvalue of the graph Laplacian of the projected data, which corresponds to a non-convex, non-smooth optimisation problem. It can be shown that the optimal univariate projection based on spectral connectivity converges to the vector normal to the maximum margin hyperplane through the data, as the scaling parameter is reduced to zero. This establishes a connection between connectivity as measured by spectral graph theory and maximal Euclidean separation.
- December 2016: “The value of evaluation: towards trustworthy machine learning”
Peter Flach, University of Bristol Machine learning, broadly defined as data-driven technology to enhance human decision making, is already in widespread use and will soon be ubiquitous and indispensable in all areas of human endeavour. Data is collected routinely in all areas of significant societal relevance including law, policy, national security, education and healthcare, and machine learning informs decision making by detecting patterns in the data. Achieving transparency, robustness and trustworthiness of these machine learning applications is hence of paramount importance, and evaluation procedures and metrics play a key role in this.In this talk I will review current issues in theory and practice of evaluating predictive machine learning models. Many issues arise from a limited appreciation of the importance of the scale on which metrics are expressed. I will discuss why it is OK to use the arithmetic average for aggregating accuracies achieved over different test sets but not for aggregating F-scores. I will also discuss why it is OK to use logistic scaling to calibrate the scores of a support vector machine but not to calibrate naive Bayes. More generally, I will discuss the need for a dedicated measurement theory for machine learning that would use latent-variable models such as item-response theory
from psychometrics in order to estimate latent skills and capabilities from observable traits.
- October 2016: “On Models, Patterns, and Prediction”
Jaakko Hollmén, Aalto University, Helsinki Pattern discovery has been the center of attention of data mining research for a long time, with patterns languages varying from simple to complex, according to the needs of the applications and the format of data. In this talk, I will take a view on pattern mining that combines elements from neighboring areas. More specifically, I will describe our previous research work in the intersection of the three areas: probabilistic modeling, pattern mining and predictive modeling. Clustering in the context of pattern mining will be explored, as well as linguistic summarization patterns. Also, multiresolution pattern mining as well as semantic pattern discovery and pattern visualization will be visited. Time allowing, I will speak about patterns of missing data and its implications on predictive modeling.Jaakko Hollmén is faculty member at the Department of Computer Science at Aalto University in Espoo, Finland. He received his doctoral degree with distinction in 2000. His research interests include data analysis, machine learning and data mining, with applications in health and in environmental informatics. He has chaired several conferences in his areas of interest, including IDA, DS, IEEE Computer-Based Medical Systems. Currently, he is co-chair of the Program Committee of ECML PKDD 2017, which is organized in Skopje, Macdonia during September 19-23, 2017. His publications can be found at: https://users.ics.aalto.fi/jhollmen/Publications/
- May 2016: “Beyond Clinical Data Mining: Electronic Phenotyping for Research Cohort Identification”
John Holmes, University of Pennsylvania The availability of ever-increasing amounts of highly heterogeneous clinical data poses both opportunities and challenges for the data scientist and clinical researcher. Electronic medical records are more prevalent than ever, and now we see that other data sources contribute greatly to the clinical research enterprise. These sources provide genetic, image, and environmental data, just to name three. Now, it is possible to investigate the effects of built environment, such as the availability of food markets, sidewalks, and playgrounds, coupled with clinical observations noted in in the process of providing patient care, along with identified genetic variants that could predispose one to diabetes mellitus. Furthermore, these data could be used in a truly integrated sense to manage such patients more effectively than relying solely on the traditional medical record. The opportunity for enhanced clinical research is manifest in this expanding data and information ecosystem. The challenges are more subtly detected, but present nonetheless. Merging these heterogeneous data into an analyzable whole depends on the availability of a robust unique identifier that has yet to be created, at least in the US. As a result, researchers have developed various probabilistic methods of record matching, occasionally at the expense of data privacy and confidentiality. Another challenge is the sheer heterogeneity of the data; it is not easy to understand the clinical context of an image or waveform without their semantic integration with clinical observation data. In addition, there is the problem of ecologic fallacy, which arises from using data that have no real connection to a clinical record in the service of proposing or testing hypotheses. This problem is quite evident when coupling environmental and clinical data: just because there is a well-stocked market with a surfeit of inexpensive, healthy food options in a person’s neighborhood doesn’t mean that that person avails herself of these items. Finally, there is the problem of data quality. Much of the data we use- whether collected by us or obtained from another source- is replete with problems, such as missingness, contradictions, and errors in representation. We will explore in detail the opportunities and challenges posed to informatics and clinical researchers as they are faced with these seemingly endless sources of data. We will also discuss novel approaches to mining these complex, heterogeneous data for the purpose of constructing cohorts for research.John Holmes is Professor of Medical Informatics in Epidemiology at the University of Pennsylvania Perelman School of Medicine. He is the Associate Director of the Penn Institute for Biomedical Informatics and is Chair of the Graduate Group in Epidemiology and Biostatistics. Dr. Holmes’ research interests are focused on several areas in medical informatics, including evolutionary computation and machine learning approaches to knowledge discovery in clinical databases (data mining), interoperable information systems infrastructures for epidemiologic surveillance, regulatory science as it applies to health information and information systems, clinical decision support systems, semantic analysis, shared decision making and patient-physician communication, and information systems user behavior. Dr. Holmes is a principal or co-investigator on projects funded by the National Cancer Institute, the Patient-Centered Outcomes Research Institute, the National Library of Medicine, and the Agency for Healthcare Research and Quality, and he is the principal investigator of the NIH-funded Penn Center of Excellence in Prostate Cancer Disparities. Dr. Holmes is engaged with the Botswana-UPenn Partnership, assisting in building informatics education and clinical research capacity in Botswana. He leads the evaluation of the National Obesity Observational Studies of the Patient-Centered Clinical Research Network. Dr. Holmes is an elected Fellow of the American College of Medical Informatics and the American College of Epidemiology