Data Adequacy Bias Impact in a Data–blinded Semi–supervised GAN for Privacy–aware COVID–19 Chest X–Ray Classification

ACM-BCB 2022 - The 13th ACM Conference on Bioinformatics, Computational Biology and Health Informatics, Chicago, IL, 2022

Abstract: Supervised machine learning models are, by definition, data-sighted, requiring to view all or most parts of the training dataset which are labeled. This paradigm presents two bottlenecks which are inter-twined: risk of exposing sensitive data samples to the third-party site with machine learning engineers, and time-consuming, laborious, bias-prone nature of data annotations by the personnel at the data source site. In this paper we studied learning impact of data adequacy as bias source in a data-blinded semi-supervised learning model for covid chest X-ray classification. Data-blindedness was put in action on a semi-supervised generative adversarial network to generate synthetic data based only on a few labeled data samples and concurrently learn to classify targets. We designed and developed a data-blind COVID–19 patient classifier that classifies whether an individual is suffering from COVID–19 or other type of illness with the ultimate goal of producing a system to assist in labeling large datasets. However, the availability of the labels in the training data had an impact in the model performance, and when a new disease spreads, as it was COVID9-19 in 2019, access to labeled data may be limited. Here, we studied how bias in the labeled sample distribution per class impacted in classification performance for three models: a Convolution Neural Network based classifier (CNN), a semi-supervised GAN using the source data (SGAN), and finally our proposed data-blinded semi-supervised GAN (BSGAN). Data-blind prevents machine learning engineers from directly accessing the source data during training, thereby ensuring data confidentiality. This was achieved by using synthetic data samples, generated by a separate generative model which were then used to train the proposed model. Our model achieved com- parable performance, with the trade–off between a privacy–aware model and a traditionally–learnt model of 0.05 AUC–score, and it maintained stable, following the same learning performance as the data distribution was changed.

Authors: Javier Pastorino, Ashis Kumer Biswas

Paper Link - Presentation - Source Code (Github)

Determination of Optimal Set of Spatio-temporal Features for Predicting Burn Probability in the State of California, USA

ACMSE 2022 - ACM Southeast Conference, Virtual Event, 2022

Abstract: Wildfires play a critical role in determining ecosystem structure and function and pose serious risks to human life, property and ecosystem services. Burn probability (BP) models the likelihood that a location could burn. Simulation models are typically used to predict BP but are computationally intensive. Machine learning (ML) pipelines can predict BP and reduce computational intensity. In this work, we tested approaches to reduce the set of input features used in an ML model to estimate BP for the state of California, USA, without loss of predictive performance. We used Principal Component Analysis (PCA) to determine the optimal set of features to use in our ML pipeline. Then, we mapped BP and compared model performance when using the reduced set and when using the whole set of features. Models using optimized input achieved similar prediction performance while using less than 50% of the input features.

Authors: Javier Pastorino, Joseph W. Director, Ashis Kumer Biswas, Todd J. Hawbaker

Paper Link - Presentation - Source Code (Github)

Data-Blind ML: Building Privacy-aware Machine Learning Models Without Direct Data Access

IEEE AIKE 2021 - IEEE International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Virtual Event, 2021 - Virtual

Abstract: Traditional Machine Learning (ML) pipeline development requires the ML practitioner to directly access the data to analyze, clean and preprocess it, in order to develop an ML model, train it and evaluate its performance. When the data owner has no infrastructure for in-house development, such pipelines are outsourced. It is common that data has some level of privacy constraints that will impose a laborious and maybe expensive infrastructure, including among others contracts drafting and infrastructure improvement. Traditional approaches rely either on anonymization which does not entirely protect from identity disclosure, or on synthetic data generation which requires expertise not necessarily available to the organization. In this paper, we present Data-Blind ML, an automated framework, fueled by synthetic generative learning and distributed computing paradigms, which enables an organization to outsource the development and training of ML models without sharing any sample from the real dataset. In addition, the framework allows the ML practitioner to get feedback of the model’s performance against the actual real data without accessing it directly.

Authors: Javier Pastorino, Ashis Kumer Biswas.

Paper Link - Presentation - Source Code (Github)

Hey ML, What Can You Do for Me?

IEEE AIKE 2020 - IEEE International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Virtual Event, 2020 - Virtual

Abstract: Machine learning (ML) algorithms are data-driven and given a goal task and a prior experience dataset relevant to the task, one can attempt to solve the task using ML seeking to achieve high accuracy. There is usually a big gap in the understanding between an ML experts and the dataset providers due to limited expertise in cross disciplines. Narrowing down a suitable set of problems to solve using ML is possibly the most ambiguous yet important agenda for data providers to consider before initiating collaborations with ML experts. We proposed an ML-fueled pipeline to identify potential problems (i.e., the tasks) so data providers can, with ease, explore potential problem areas to investigate with ML. The autonomous pipeline integrates information theory and graph-based unsupervised learning paradigms in order to generate a ranked retrieval of top-$k$ problems for the given dataset for a successful ML based collaboration. We conducted experiments on diverse real-world and well-known datasets, and from a supervised learning standpoint, the proposed pipeline achieved $72\%$ top-$5$ task retrieval accuracy on an average, which surpasses the retrieval performance for the same paradigm using the popular exploratory data analysis tools. Detailed experiment results with our source code are available at out Github.

Authors: Javier Pastorino, Ashis Kumer Biswas.

Paper Link - Presentation - Source Code (Github)

TexAnASD: Text Analytics for ASD Risk Gene Predictions

IEEE BIBM 2019 - IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 2019, pp. 1350-1357

Abstract: Autism Spectrum Disorder (ASD) is an extreme neurodevelopmental disease affecting 1 in every 59 children in the United States, and approximately 1% of US population. The clinical traits of the disorder include noticeable deficits in social interactions, language development and in many cases very narrowed and repetitive interests and behaviors. ASD is highly heritable genetic disease, but the known causes including biomarkers causing it are forming only the tip of the iceberg. Over the past decade extensive research on exome sequences revealed only around around one hundred genes causing it with a very high confidence. Number of putative ASD causing genes is rapidly growing with the advent of new technologies while researchers are struggling now to assess which genes are true causing genes. Manual curation of each of the long list of genes is a cumbersome process that requires huge amount of expert work-hours, and is expensive. An in silico prediction method can assist the human experts to check only a short-list of genes which were filtered by a machine learning system. Most of existing prediction algorithm either involve high-performance computing platform to analyze large-scale genetic data which is counter-intuitive to the actual benefit of using an in silico method in the first place. We proposed TexAnASD, a text analytics based ASD gene prediction algorithm that utilized only what we know about each gene that we learn from published literature. The proposed method outperforms most of the state-of-the- art prediction systems. Moreover, the method builds the least complex model than all the others.

Authors: Javier Pastorino, Ashis Kumer Biswas.

Paper link


Methodological Guide for Accessible Virtual Curriculum Developments Implementation


Universidad de Alcalá. April 2013

Abstract: This methodology guide for implementing virtual curriculum developments accessible has been developed as part of ESVI-AL project. This guide is designed as a support tool for everyone involved in accessible virtual educational projects, primarily for teachers, but also for management staff, administration and technical institutions, seeking to implement inclusive virtual training activities, in which can participate on equal terms all students.

Authors: José Ramón Hilera, Regina Motz, Javier Pastorino,

ISBN: 978-84-15834-07-6

Transformations between Temporal Evolution Models

Cacic 2002

Abstract: Temporal databases store information evolutions during the time. Such evolution may be classified in schema evolution and extension evolution. This allows classifying temporal information systems in four different types considering the capabilities to manipulate the temporal evolution dimensions. The key target of the present study is the definition of a model that allows sharing information stored in that kind of databases based on methodologies to convert data between the models. We conclude that this kind of transformation is possible, without losing semantics of the information or the evolution registered in the source system, making a transformation to a Bi-Temporal Evolution System from other model and using this one as an equivalent to the original model.

Authors: Javier Pastorino, Regina Mot

Available in Spanish only: