Opening the Black Box (2018-2019)
Both the commercial and academic sector are exploring the use of their state-of-the-art algorithms to make important decisions, for example in healthcare. These algorithms exploit a heterogeneous mix of on-body sensor data, clinical test results, socio-economic information, and digitised electronic health records. A major issue is that many of the algorithms on offer are often black box in nature (defined as a system which can be viewed in terms of its inputs and outputs without any knowledge of its internal workings). This is because the algorithms are often extremely complex with many parameters (such as deep learning) and also because the algorithms themselves are now valuable commodities. Not knowing the underlying mechanisms of these black box systems is a problem for two important reasons. Firstly, if the predictive models are not transparent and explainable, we lose the trust of experts such as healthcare practitioners. Secondly, without access to the knowledge of how an algorithm works we cannot truly understand the underlying meaning of the output. These problems need to be addressed if we are to make new insights into data such as disease and health.
This seminar series will include talks from experts from statistics, computer science, psychology, and health / medicine in order to address this issue head-on. The seminar will focus on building a network of experts in state-of-the-art technologies that exploit the huge data resources available, while ensuring that these systems are explainable to domain experts. This will result in systems that not only generate new insights but are also more fully trusted.
(All seminars held at 3pm in WLFB 207/208)
22 January: Nadine Aburumman, Brunel University London.
- TBC: Pedro Rodrigues, University of Porto
- TBC: Sandra Wachter, The Oxford Internet Institute
11 Dec 2019: Arianna Dagliati, Manchester University
Temporal phenotyping for precision medicine
- Arianna is a Research Fellow in the Manchester Molecular Pathology Innovation Centre, and the Division of Informatics, Imaging & Data Sciences, University of Manchester. Her background is in Bioengineering and Bioinformatics, with broad experience in applying machine learning approaches to knowledge discovery, predictive modeling, temporal and process mining, and software engineering. Near the Manchester Molecular Pathology Innovation Centre her research is dedicated to the discovery of novel biomarker in autoimmune diseases, based on the integration of clinical and multi-omic data. She develop novel analysis pipelines for Precision Medicine analytical approaches for identifying temporal patterns and electronic phenotypes in longitudinal clinical data, and for their exploitation in clinical decision support. Together a multi-disciplinary team of researchers, developers and clinicians, her research is aimed at understanding how under-used health data can be re-purposed to improve health. Working on different research projects, she combines the informatics technology with statistical models to answer scientific questions using data derived from EHR, cohort studies and public data resources. In the past she collaborated with the Harvard Medical School, Informatics for Integrating Biology and the Bedside team for its first implementation in oncologic care in Europe and with the Department of Biostatistics and Epidemiology at the University of Pennsylvania for the development of novel careflow mining approaches for enabling the recognition of temporal patterns and electronic phenotypes in longitudinal clinical data.
Abstract A key trend in current medical research is a shift from a one-size-fit-all to precision treatment strategies, where the focus is on identifying narrow subgroups of the population who would benefit from a given intervention. Precision medicine greatly benefits from algorithms and accessible tools that clinicians can use to identify such subgroups, and to generate novel inferences about the patient population they are treating. Complexity and variability of patients’ trajectories in response to treatment poses significant challenges in many medical fields, especially in those requiring long-term care, longitudinal analytics methods, their exploitation in the context of clinical decisions, and their translation into clinical practice through accessible tools, represents a potential for enabling precision healthcare.
The seminar will discuss challenges for Precision Medicine and approaches to exploit longitudinal data for subgroup discovery in Rheumatoid Arthritis and to identify trajectories representing different temporal phenotypes in Type 2 Diabetes.
11 Dec 2019: Lucia Sacchi, University of Pavia, Italy
Lucia Sacchi is Associate Professor at the Department of Electrical, Computer and Biomedical Engineering at the University of Pavia, Italy. She’s got a Master Degree in Computer Engineering and a PhD in Bioengineering and Bioinformatics, both taken at the University of Pavia. She was post-doctoral fellow at the University of Pavia, Senior Research Fellow at the Brunel University London (UK), and Assistant Professor at the University of Pavia. Her research interests are related to data mining, with particular focus on temporal data, clinical decision support systems, process mining, and technologies for biomedical data analysis.
She is the Chair of the IMIA working group on Data Mining and Big Data Analytics, vice-chair of the board of the Artificial Intelligence in Medicine (AIME) Society, and member of the board of the Italian Society of Biomedical Informatics (SIBIM). She is part of the Editorial Board of BMC Medical Informatics and Decision Making, Artificial Intelligence in Medicine, Journal of Biomedical Informatics (JBI), and she is Academic Editor for PLOS ONE. She has co-authored more than 90 scientific peer-reviewed publications on international journals and international conferences.
Abstract The increasing availability of time-dependent health-related data, both collected in Hospital Information Systems during clinical practice, and by patients who use wearable monitoring devices, offers some interesting research challenges. Among these, enriching clinical decision support systems with advanced tools for the analysis of longitudinal data is of paramount importance. Such tools can be useful to synthesise the patients’ conditions in between encounters to identify critical situations in advance, or to study temporal trajectories of chronic disease evolution to plan timely targeted interventions. This talk will introduce the problem of the analysis of temporal data coming from different sources, and will describe some methodologies that can be useful to analyse heterogeneous data. Moreover, it will present some examples on how such analysis has been integrated in real-world clinical decision support systems.
6th Nov 2019: In house speakers:
Gabriele Scali: Constraint Satisfaction Problems and Constraint Programming
Leila Yousefi: The Prevalence of Errors in Machine Learning Experiments
4th Oct 2019: Jaakko Hollmén, Stockholm University, Department of Computer and Systems Sciences, Sweden
Diagnostic prediction in neonatal intensive care units
- Jaakko Hollmén is a faculty member at Department of Computer and Systems Sciences at Stokcholm University in Sweden (since September 2019). Prior to joining Stokcholm university, he was a faculty member at the Department of Computer Science at Aalto University in Finland. His research interests include theory and practice of machine learning and data mining, in particular in the context of health, medicine and environmental sciences. He has been involved in the organization of many IDA conferences for the past ten years. He is also the secretary of the IDA council.
Abstract: Preterm infants, born before 37 weeks of gestation, are subject to many developmental issues and health problems. Very Low Birth Weight (VLBW) infants, with a birth weight under 1500 g, are the most afflicted in this group. These infants require treatment in the neonatal intensive care unit before they are mature enough for hospital discharge. The neonatal intensive care unit is a data-intensive environment, where multi-channel physiological data is gathered from patients using a number of sensors to construct a comprehensive picture of the patients’ vital signs. We have looked into the problem how to predict neonatal in-hospital mortality and morbidities. We have used time series data collected from Very Low Birth Weight infants treated in the neonatal intensive care unit of Helsinki University Hospital between 1999 and 2013. Our results show that machine learning models based on time series data alone have predictive power comparable with standard medical scores, and combining the two results in improved predictive ability. We have also studied the effect of observer bias on recording vital sign measurements in the neonatal intensive care unit, as well as conducted a retrospective cohort study on trends in the growth of Extremely Low Birth Weight (birth weight under 1000 g) infants during intensive care.
May 15th: John Holmes, University of Pennsylvania
Explainable AI for the (Not-Always-Expert) Clinical Researcher
- John H. Holmes, PhD, is Professor of Medical Informatics in Epidemiology at the University of Pennsylvania Perelman School of Medicine. He is the Associate Director of the Institute for Biomedical Informatics, Director of the Master’s Program in Biomedical Informatics, and Chair of the Doctoral Program in Epidemiology, all at Penn. Dr. Holmes has been recognized nationally and internationally for his work on developing and applying new approaches to mining epidemiologic surveillance data, as well as his efforts at furthering educational initiatives in clinical research. Dr. Holmes’ research interests are focused on the intersection of medical informatics and clinical research, specifically evolutionary computation and machine learning approaches to knowledge discovery in clinical databases, deep electronic phenotyping, interoperable information systems infrastructures for epidemiologic surveillance, and their application to a broad array of clinical domains, including cardiology and pulmonary medicine. He has collaborated as the informatics lead on an Agency for Healthcare Research and Quality-funded project at Harvard Medical School to establish a scalable distributed research network, and he has served as the co-lead of the Governance Core for the SPAN project, a scalable distributed research network; he participates in the FDA Sentinel Initiative. Dr. Holmes has served as the evaluator for the PCORNet Obesity Initiative studies, where he was responsible for developing and implementing the evaluation plan and metrics for the initiative. Dr. Holmes is or has been a principal or co-investigator on projects funded by the National Cancer Institute, the National Library of Medicine, and the Agency for Healthcare Research and Quality, and he was the Penn principal Investigator of the NIH-funded Penn Center of Excellence in Prostate Cancer Disparities. Dr. Holmes is engaged with the Botswana-UPenn Partnership, assisting in building informatics education and clinical research capacity in Botswana. Dr. Holmes is an elected Fellow of the American College of Medical Informatics (ACMI), the American College of Epidemiology (ACE), and the International Academy of Health Sciences Informatics (IAHSI).
Abstract Armed with a well-founded research question, the clinical researcher’s next step is usually to seek out the data that could help answer it, although the researcher can use data to discover a new research question. In both cases, the data will already be available, and so either approach to inquiry can be appropriate and justifiable. However, the next steps- data preparation, analytics, and inference- are often thorny issues that even the most seasoned researcher must address, and sometimes not so easily. Traditional approaches to data preparation, that include such methods as frequency distribution and contingency table analyses to characterize the data are themselves open to considerable investigator bias. In addition, there is considerable tedium resulting from applying these methods- for example, how many contingency tables does it take to identify variable interactions? It is arguable that feature selection and construction are two tasks not to be left only to human interpretation. Yet we don’t see much in the way of novel approaches to “experiencing” data such that new, data-driven insights arise during the data preparation process. The same can be said for analysis, where even state-of-the art statistical methods, informed or driven by pre-formed hypotheses and the results of feature selection processes, sometimes hampers truly novel knowledge discovery. As a result, inferences made from these analyses likewise suffer. However, new approaches to making AI explainable to users, in this case clinical researchers who do not have the time or inclination to develop a deep understanding of how this or that AI algorithm works, are critically important, and their dearth represents a gap that those of us in clinical research informatics need to fill. Yet, the uninitiated shy away from AI for the very lack of explainability. This talk will explore some new methods for making AI explainable, one of which, PennAI, has been developed at the University of Pennsylvania. PennAI will be demonstrated using several sample datasets.
March 13th : Mario Cannataro, Università degli Studi Magna Graecia di Catanzaro
- Mario Cannataro is a Full Professor of Computer Engineering and Bioinformatics at University “Magna Graecia” of Catanzaro, Italy. He is the director of the Data Analytics research centre and the chair of the Bioinformatics Laboratory at University “Magna Graecia” of Catanzaro. His current research interests include bioinformatics, medical informatics, data analytics, parallel and distributed computing. He is a Member of the editorial boards of IEEE/ACM Transaction on Computational Biology and Bioinformatics, Briefings in Bioinformatics, High-Throughput, Encyclopaedia of Bioinformatics and Computational Biology, Encyclopaedia of Systems Biology. He was guest editor of several special issues on bioinformatics and he is serving as a program committee member of several conferences. He published three books and more than 200 papers in international journals and conference proceedings. Mario Cannataro is a Senior Member of IEEE, ACM and BITS, and a member of the Board of Directors for ACM SIGBio.
Abstract: Recently, several factors are moving biomedical research towards a (big) data-centred science:(i) the Volume of data in bioinformatics is having an explosion, especially in healthcare and medicine; (ii) new bioinformatics data is created at increasing Velocity due to advances in experimental platform and increased use of IoT (Internet of Things) health monitoring sensors; (iii) increasing Variety and (iv) Variability of data (omics, clinical, administration, sensors, and social data are inherently heterogeneous) that may lead to wrong modelling, integration and interpretation, and finally (v) increasing Value of data in bioinformatics due to costs of infrastructures to produce and analyze data, as well as, value of extracted biomedical knowledge. The emerging of this Big Data trend in Bioinformatics poses new challenges for computer science solutions, regarding the efficient storage, preprocessing, integration and analysis of omics (e.g. genomics, proteomics, and interactomics) and clinical (e.g. laboratory data, bioimages, pharmacology data, social network data, etc.) data, resulting in a main bottleneck of the analysis pipeline. To face those challenges, main trends are: (i) use of high-performance computing in all steps of analysis pipeline, including parallel processing of raw experimental data, parallel analysis of data, and efficient data visualization; (ii) deployment of data analysis pipelines and main biological databases on the Cloud; (iii) use of novel data models that combine structured (e.g. relational data) and unstructured (e.g. text, multimedia, biosignals, bioimages) data, with special focus on graph databases; (iv) development of novel data analytics methods such as Sentiment Analysis, Affective Computing and Graph Analytics, that integrate traditional statistical and data mining analysis; (v) particular attention to issues regarding privacy of patients, as well as permitted ways to use and analyze biomedical data.
After recalling main omics data, the first part of the talk presents some experiences and applications related to the preprocessing and data mining analysis of omics, clinical and social data, conducted at University Magna Graecia of Catanzaro. Some case studies in the oncology (pharmacogenomics data) and paediatrics (sentiment analysis) domains are also presented. With the availability of large datasets, Deep Learning algorithms have proved to lead to state of the art performance in many different problems, as for example in text classification. However, deep models have the drawback of not being human-interpretable, raising various problems related to model’s interpretability. Model interpretability is another important aspect to be considered in order to develop a Clinical Decision Support System (CDSS) that clinicians can trust. In particular, an interpretable CDSS can ensure that: i) clinicians understand the system predictions (in the sense that predictions are required to be consistent with medical knowledge); ii) the decisions will not negatively affect the patient; iii) the decisions are ethical; iv) the system is optimized on complete objectives; and v) the system is accurate and sensible patient data are protected. Therefore there is the need of new strategies for developing explainable AI systems for supporting medical decisions and, in particular, for presenting human-understandable explanations to clinicians and that can also take into account sentiment analysis or, more in general, explainable text classification methodologies. Recently, the deep network architecture called Capsule Networks has gained a lot of interest, also showing intrinsic properties that can potentially improve model explainability in image recognition. However, to the best of our knowledge, if Capsule Networks might improve explainablity for text classification problems is a point that needs to be further investigated. The second part of this talk will focus on a brief overview of proposed explainable models and then will present some discussion related to how Capsule Networks can be adapted to sentiment classification problems in order to improve explainability.
January 16th: Norman Fenton, Queen Mary, University of London
Pearse A. Keane, MD, FRCOphth, is a consultant ophthalmologist at Moorfields Eye Hospital, London and an NIHR Clinician Scientist, based at the Institute of Ophthalmology, University College London (UCL). Dr Keane specialises in applied ophthalmic research, with a particular interest in retinal imaging and new technologies. In April 2015, he was ranked no. 4 on a worldwide ranking of ophthalmologists under 40, published in “the Ophthalmologist” journal (https://theophthalmologist.com/the-power-list-2015/). In 2016, he initiated a formal collaboration between Moorfields Eye Hospital and Google DeepMind, with the aim of applying machine learning to automated diagnosis of optical coherence tomography (OCT) images. In August 2018, the first results of this collaboration were published in the journal, Nature Medicine.The Moorfields-DeepMind Collaboration – Reinventing the Eye ExaminationOphthalmology is among the most technology-driven of the all the medical specialties, with treatments utilizing high-spec medical lasers and advanced microsurgical techniques, and diagnostics involving ultra-high resolution imaging. Ophthalmology is also at the forefront of many trailblazing research areas in healthcare, such as stem cell therapy, gene therapy, and – most recently – artificial intelligence. In July 2016, Moorfields announced a formal collaboration with the world’s leading artificial intelligence company, DeepMind. This collaboration involves the sharing of >1,000,000 anonymised retinal scans with DeepMind to allow for the automated diagnosis of diseases such as age-related macular degeneration (AMD) and diabetic retinopathy (DR). In my presentation, I will describe the motivation – and urgent need – to apply deep learning to ophthalmology, the processes required to establish a research collaboration between the NHS and a company like DeepMind, the initial results of our research, and finally, why I believe that ophthalmology could be first branch of medicine to be fundamentally reinvented through the application of artificial intelligence.
Norman Fenton is Professor of Risk Information Management at Queen Mary London University and is also a Director of Agena, a company that specialises in risk management for critical systems. Norman is a mathematician by training whose current research focuses on critical decision-making and, in particular, on quantifying uncertainty using a ‘smart data’ that combines data with expert judgment. Applications include law and forensics (Norman has been an expert witness in major criminal and civil cases), health, security, software reliability, transport safety and reliability, finance, and football prediction. Norman has been PI in grants totalling over £10million. He currently leads an EPRSC Digital Health Technologies Project (PAMBAYESIAN) and a Leverhulme Trust grant (CAUSAL-DYNAMICS). In 2014 Norman was awarded a prestigious European Research Council Advanced Grant (BAYES-KNOWLEDGE) in which the ‘smart data’ approach evolved. Since June 2011 he has led an international consortium (Bayes and the Law) of statisticians, lawyers and forensic scientists working to improve the use of statistics in court. In 2016 he led a prestigious 6-month Programme on Probability and Statistics in Forensic Science at the Isaac Newton Institute for Mathematical Sciences, University of Cambridge where he was also a Simons Fellow. He was appointed as a Fellow of The Turing Institute in 2018. In March 2015 Norman presented award-winning BBC documentary Climate Change by Numbers.
Abstract: Misunderstandings about risk, statistics and probability often lead to flawed decision-making in many critical areas such as medicine, finance, law, defence, and transport. The ‘big data’ revolution was intended to at least partly address these concerns by removing reliance on subjective judgments. However, even where (relevant) big data are available there are fundamental limitations to what can be achieved through pure machine learning techniques. This talk will explain the successes and challenges in using causal probabilistic models of risk – based on a technique called Bayesian networks – in providing powerful decision-support and accurate predictions by a ‘smart data’ approach. This combines minimal data with expert judgment. The talk will provide examples in chronic diseases, forensics, terrorist threat analysis, and even sports betting.
- December 12th 2018: Pearse Keane, Moorfields Eye Hospital “Artificial Intelligence in Ophthalmology“.
Niels Peek is Professor of Health Informatics and Strategic Research Domain Director for Digital Health at the University of Manchester. He has a background in Computer Science and Artificial Intelligence, and his research focuses on data-driven methods for health research, healthcare quality improvement, and computerised decision support. From 2013 to 2017 he was the President of the Society for Artificial Intelligence in Medicine (AIME). He is a member of the editorial boards of the Journal of the American Medical Informatics Association and the Artificial Intelligence in Medicine journal. In April 2017, he organised the Informatics for Health 2017 conference in Manchester which was attended by more than 800 people from 30 countries. He also co-chaired the Scientific Programme Committee of MEDINFO-2017, the 16th World Congress on Health and Biomedical Informatics, which was held in Hangzhou, China, in August 2017. In 2018 he was elected to become a fellow of the American Collecege of Medical Informaticians and a fellow of the Alan Turing Institute.
My talk will introduce the concept of “Learning Health Systems” and focus on the role of clinical prediction models within these systems. Building on the distinction between explanatory and predictive models (which is commonly made in statistics and epidemiology but not in computer science) I will review the use of machine learning and statistical modelling in healthcare; discuss the role of model interpretation and transparency in explanatory and predictive models; and discuss the suitability of different analytical methods to facilitate interpretability and transparency
- October 17th: Allan Tucker, “Opening the Black Box“, Brunel University London
- July 2018: “Temporal Information Extraction from Clinical Narratives”
Natalia Viani, King’s College London Electronic health records represent a great source of valuable information for both patient care and biomedical research. Despite the efforts put into collecting structured data, a lot of information is available only in the form of free-text. For this reason, developing natural language processing (NLP) systems that identify clinically relevant concepts (e.g., symptoms, medication) is essential. Moreover, contextualizing these concepts from the temporal point of view represents an important step.
Over the past years, many NLP systems have been developed to process clinical texts written in English and belonging to specific medical domains (e.g., intensive care unit, oncology). However, research for multiple languages and domains is still limited. Through my PhD years, I applied information extraction techniques to the analysis of medical reports written in Italian, with a focus on the cardiology domain. In particular, I explored different methods for extracting clinical events and their attributes, as well as temporal expressions.
At the moment, I am working on the analysis of mental health records for patients with a diagnosis of schizophrenia, with the aim to automatically identify symptom onset information starting from clinical notes.Dr Viani is a postdoctoral research associate at the Department of Psychological Medicine, NIHR Biomedical Research Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London. She received her PhD in Bioengineering and Bioinformatics from the Department of Electrical, Computer and Biomedical Engineering, University of Pavia, in January 2018. During her PhD, she spent six months as a visiting research scholar in the Natural Language Processing Laboratory at the Computational Health Informatics Program at Boston Children’s Hospital – Harvard Medical School.
Her research interests are natural language processing, clinical and temporal information extraction, and biomedical informatics. I am especially interested in the reconstruction of clinical timelines starting from free-text.
- March 2018: “Optimal Low-dimensional Projections for Spectral Clustering”
Nicos Pavlidis, Lancaster University The presentation will discuss the problem of determining the optimal low dimensional projection for maximising the separability of a binary partition of an unlabelled dataset, as measured by spectral graph theory. This is achieved by finding projections which minimise the second eigenvalue of the graph Laplacian of the projected data, which corresponds to a non-convex, non-smooth optimisation problem. It can be shown that the optimal univariate projection based on spectral connectivity converges to the vector normal to the maximum margin hyperplane through the data, as the scaling parameter is reduced to zero. This establishes a connection between connectivity as measured by spectral graph theory and maximal Euclidean separation.
- December 2016: “The value of evaluation: towards trustworthy machine learning”
Peter Flach, University of Bristol Machine learning, broadly defined as data-driven technology to enhance human decision making, is already in widespread use and will soon be ubiquitous and indispensable in all areas of human endeavour. Data is collected routinely in all areas of significant societal relevance including law, policy, national security, education and healthcare, and machine learning informs decision making by detecting patterns in the data. Achieving transparency, robustness and trustworthiness of these machine learning applications is hence of paramount importance, and evaluation procedures and metrics play a key role in this.In this talk I will review current issues in theory and practice of evaluating predictive machine learning models. Many issues arise from a limited appreciation of the importance of the scale on which metrics are expressed. I will discuss why it is OK to use the arithmetic average for aggregating accuracies achieved over different test sets but not for aggregating F-scores. I will also discuss why it is OK to use logistic scaling to calibrate the scores of a support vector machine but not to calibrate naive Bayes. More generally, I will discuss the need for a dedicated measurement theory for machine learning that would use latent-variable models such as item-response theory
from psychometrics in order to estimate latent skills and capabilities from observable traits.
- October 2016: “On Models, Patterns, and Prediction”
Jaakko Hollmén, Aalto University, Helsinki Pattern discovery has been the center of attention of data mining research for a long time, with patterns languages varying from simple to complex, according to the needs of the applications and the format of data. In this talk, I will take a view on pattern mining that combines elements from neighboring areas. More specifically, I will describe our previous research work in the intersection of the three areas: probabilistic modeling, pattern mining and predictive modeling. Clustering in the context of pattern mining will be explored, as well as linguistic summarization patterns. Also, multiresolution pattern mining as well as semantic pattern discovery and pattern visualization will be visited. Time allowing, I will speak about patterns of missing data and its implications on predictive modeling.Jaakko Hollmén is faculty member at the Department of Computer Science at Aalto University in Espoo, Finland. He received his doctoral degree with distinction in 2000. His research interests include data analysis, machine learning and data mining, with applications in health and in environmental informatics. He has chaired several conferences in his areas of interest, including IDA, DS, IEEE Computer-Based Medical Systems. Currently, he is co-chair of the Program Committee of ECML PKDD 2017, which is organized in Skopje, Macdonia during September 19-23, 2017. His publications can be found at: https://users.ics.aalto.fi/jhollmen/Publications/
- May 2016: “Beyond Clinical Data Mining: Electronic Phenotyping for Research Cohort Identification”
John Holmes, University of Pennsylvania The availability of ever-increasing amounts of highly heterogeneous clinical data poses both opportunities and challenges for the data scientist and clinical researcher. Electronic medical records are more prevalent than ever, and now we see that other data sources contribute greatly to the clinical research enterprise. These sources provide genetic, image, and environmental data, just to name three. Now, it is possible to investigate the effects of built environment, such as the availability of food markets, sidewalks, and playgrounds, coupled with clinical observations noted in in the process of providing patient care, along with identified genetic variants that could predispose one to diabetes mellitus. Furthermore, these data could be used in a truly integrated sense to manage such patients more effectively than relying solely on the traditional medical record. The opportunity for enhanced clinical research is manifest in this expanding data and information ecosystem. The challenges are more subtly detected, but present nonetheless. Merging these heterogeneous data into an analyzable whole depends on the availability of a robust unique identifier that has yet to be created, at least in the US. As a result, researchers have developed various probabilistic methods of record matching, occasionally at the expense of data privacy and confidentiality. Another challenge is the sheer heterogeneity of the data; it is not easy to understand the clinical context of an image or waveform without their semantic integration with clinical observation data. In addition, there is the problem of ecologic fallacy, which arises from using data that have no real connection to a clinical record in the service of proposing or testing hypotheses. This problem is quite evident when coupling environmental and clinical data: just because there is a well-stocked market with a surfeit of inexpensive, healthy food options in a person’s neighborhood doesn’t mean that that person avails herself of these items. Finally, there is the problem of data quality. Much of the data we use- whether collected by us or obtained from another source- is replete with problems, such as missingness, contradictions, and errors in representation. We will explore in detail the opportunities and challenges posed to informatics and clinical researchers as they are faced with these seemingly endless sources of data. We will also discuss novel approaches to mining these complex, heterogeneous data for the purpose of constructing cohorts for research.John Holmes is Professor of Medical Informatics in Epidemiology at the University of Pennsylvania Perelman School of Medicine. He is the Associate Director of the Penn Institute for Biomedical Informatics and is Chair of the Graduate Group in Epidemiology and Biostatistics. Dr. Holmes’ research interests are focused on several areas in medical informatics, including evolutionary computation and machine learning approaches to knowledge discovery in clinical databases (data mining), interoperable information systems infrastructures for epidemiologic surveillance, regulatory science as it applies to health information and information systems, clinical decision support systems, semantic analysis, shared decision making and patient-physician communication, and information systems user behavior. Dr. Holmes is a principal or co-investigator on projects funded by the National Cancer Institute, the Patient-Centered Outcomes Research Institute, the National Library of Medicine, and the Agency for Healthcare Research and Quality, and he is the principal investigator of the NIH-funded Penn Center of Excellence in Prostate Cancer Disparities. Dr. Holmes is engaged with the Botswana-UPenn Partnership, assisting in building informatics education and clinical research capacity in Botswana. He leads the evaluation of the National Obesity Observational Studies of the Patient-Centered Clinical Research Network. Dr. Holmes is an elected Fellow of the American College of Medical Informatics and the American College of Epidemiology