Peter Flach, University of Bristol
Machine learning, broadly defined as data-driven technology to enhance human decision making, is already in widespread use and will soon be ubiquitous and indispensable in all areas of human endeavour. Data is collected routinely in all areas of significant societal relevance including law, policy, national security, education and healthcare, and machine learning informs decision making by detecting patterns in the data. Achieving transparency, robustness and trustworthiness of these machine learning applications is hence of paramount importance, and evaluation procedures and metrics play a key role in this.
In this talk I will review current issues in theory and practice of evaluating predictive machine learning models. Many issues arise from a limited appreciation of the importance of the scale on which metrics are expressed. I will discuss why it is OK to use the arithmetic average for aggregating accuracies achieved over different test sets but not for aggregating F-scores. I will also discuss why it is OK to use logistic scaling to calibrate the scores of a support vector machine but not to calibrate naive Bayes. More generally, I will discuss the need for a dedicated measurement theory for machine learning that would use latent-variable models such as item-response theory
from psychometrics in order to estimate latent skills and capabilities from observable traits.
Jaakko Hollmén, Aalto University, Helsinki
Pattern discovery has been the center of attention of data mining research for a long time, with patterns languages varying from simple to complex, according to the needs of the applications and the format of data. In this talk, I will take a view on pattern mining that combines elements from neighboring areas. More specifically, I will describe our previous research work in the intersection of the three areas: probabilistic modeling, pattern mining and predictive modeling. Clustering in the context of pattern mining will be explored, as well as linguistic summarization patterns. Also, multiresolution pattern mining as well as semantic pattern discovery and pattern visualization will be visited. Time allowing, I will speak about patterns of missing data and its implications on predictive modeling.
Jaakko Hollmén is faculty member at the Department of Computer Science at Aalto University in Espoo, Finland. He received his doctoral degree with distinction in 2000. His research interests include data analysis, machine learning and data mining, with applications in health and in environmental informatics. He has chaired several conferences in his areas of interest, including IDA, DS, IEEE Computer-Based Medical Systems. Currently, he is co-chair of the Program Committee of ECML PKDD 2017, which is organized in Skopje, Macdonia during September 19-23, 2017. His publications can be found at: https://users.ics.aalto.fi/jhollmen/Publications/
John Holmes, University of Pennsylvania
The availability of ever-increasing amounts of highly heterogeneous clinical data poses both opportunities and challenges for the data scientist and clinical researcher. Electronic medical records are more prevalent than ever, and now we see that other data sources contribute greatly to the clinical research enterprise. These sources provide genetic, image, and environmental data, just to name three. Now, it is possible to investigate the effects of built environment, such as the availability of food markets, sidewalks, and playgrounds, coupled with clinical observations noted in in the process of providing patient care, along with identified genetic variants that could predispose one to diabetes mellitus. Furthermore, these data could be used in a truly integrated sense to manage such patients more effectively than relying solely on the traditional medical record. The opportunity for enhanced clinical research is manifest in this expanding data and information ecosystem. The challenges are more subtly detected, but present nonetheless. Merging these heterogeneous data into an analyzable whole depends on the availability of a robust unique identifier that has yet to be created, at least in the US. As a result, researchers have developed various probabilistic methods of record matching, occasionally at the expense of data privacy and confidentiality. Another challenge is the sheer heterogeneity of the data; it is not easy to understand the clinical context of an image or waveform without their semantic integration with clinical observation data. In addition, there is the problem of ecologic fallacy, which arises from using data that have no real connection to a clinical record in the service of proposing or testing hypotheses. This problem is quite evident when coupling environmental and clinical data: just because there is a well-stocked market with a surfeit of inexpensive, healthy food options in a person’s neighborhood doesn’t mean that that person avails herself of these items. Finally, there is the problem of data quality. Much of the data we use- whether collected by us or obtained from another source- is replete with problems, such as missingness, contradictions, and errors in representation. We will explore in detail the opportunities and challenges posed to informatics and clinical researchers as they are faced with these seemingly endless sources of data. We will also discuss novel approaches to mining these complex, heterogeneous data for the purpose of constructing cohorts for research.
John Holmes is Professor of Medical Informatics in Epidemiology at the University of Pennsylvania Perelman School of Medicine. He is the Associate Director of the Penn Institute for Biomedical Informatics and is Chair of the Graduate Group in Epidemiology and Biostatistics. Dr. Holmes’ research interests are focused on several areas in medical informatics, including evolutionary computation and machine learning approaches to knowledge discovery in clinical databases (data mining), interoperable information systems infrastructures for epidemiologic surveillance, regulatory science as it applies to health information and information systems, clinical decision support systems, semantic analysis, shared decision making and patient-physician communication, and information systems user behavior. Dr. Holmes is a principal or co-investigator on projects funded by the National Cancer Institute, the Patient-Centered Outcomes Research Institute, the National Library of Medicine, and the Agency for Healthcare Research and Quality, and he is the principal investigator of the NIH-funded Penn Center of Excellence in Prostate Cancer Disparities. Dr. Holmes is engaged with the Botswana-UPenn Partnership, assisting in building informatics education and clinical research capacity in Botswana. He leads the evaluation of the National Obesity Observational Studies of the Patient-Centered Clinical Research Network. Dr. Holmes is an elected Fellow of the American College of Medical Informatics and the American College of Epidemiology