AI3011: Machine Learning and Pattern Recognition

World Heart Federation (WHF) estimated that 20.5 million people died from cardiovascular diseases in 2023, representing 34% of all global deaths. Out of this, 72.1% were unaware of their hypertension status. Heart sound analysis is a non-invasive technique that can be used to facilitate early identification of cardiac abnormalities such as murmurs.

Our project aims to employ machine learning for the analysis of heart sounds collected via devices such as a digital stethoscope, to facilitate the early identification of cardiac abnormalities and murmurs. We aim to expand this feature to smartphones, where people can easily check their heart health and determine when medical advice is necessary.

We used the publicly available PhysioNet 2022 challenge dataset, which contains 5000+ heart sound recordings, labelled for murmur presence and normality. Previous studies, such as McDonald et al. (2022) - the winners of the aforementioned challenge, use various classification techniques including CNNs and RNNs. None, however, apply the classification to classify Mobile phone heart recordings, or modify the model to account for the same.

After preprocessing the data by clipping, downsampling, and filtering using a butterworth filter, we used the extracted MFCC images of the sound as input to our model. After trying both CNNs and RNN (LSTM) for classification, we observed better performance with the CNN model. This model accounted for the temporal variation in the sound, and was robust enough to classify the sub-par mobile recordings.

Performance metrics - Cost and Weighted Accuracy, defined by the dataset creators Physionet, were used to evaluate our model. These were designed primarily to prevent false negatives (non-diagnosis of CVD). Our model would rank 1st amongst all challenge participants with a low cost of 8388, and a weighted accuracy of 0.716.

Accent Recognition

Mukundan Gurumurthy, Parth Ghule, Srirangadeep Yarlagatta

Speech recognition, dialect identification, and language learning are among the many applications of accent classification. In previous studies, researches only investigated few accents or lacked the efficiency to handle different accents when trying to classify them using acoustic features and machine learning models. In addition to that we will also classify the gender and age through the audio, which has not been done before. The aim of this project is to come up with a reliable and precise system for classifying multiple accents that can be widely used; it will do so by making use of various feature extraction techniques alongside multiple machine learning algorithms plus an ensemble voting classifier.

The approach followed here is based on Common Voice – a dataset containing diverse speeches spoken with different accents which was created by Mozilla. The first step involves preprocessing the raw audios so as to extract Mel-frequency cepstral coefficients (MFCCs) which represent robust acoustic characteristics. These MFCC attributes act as inputs with four different machine learning methods being trained on them: k-nearest neighbors (KNN) (85% accuracy across 7 classes), support vector machines (SVM)(68% accuracy across 10 classes), neural network (NN)(65% accuracy across 10 classes) and convolutional neural networks (CNN). Lastly, all the individual model’s predictions are then combined through an ensemble voting classifier which increases the accuracy of the classification.

AdaptiveSpeech: Dynamic Speech Rate Adjustment for Effective Communication

Manavpreet Singh Cheema, Mudit Surana, Vaisakh Menon

The need for advanced speech rate adjustment mechanisms is crucial for individuals suffering from speech impairments like dysarthria, affecting approximately 23.47 million people globally. Previous studies, such as those leveraging the CREMA-D dataset, primarily focused on basic emotional recognition and speech analysis. Our project aims to dynamically adjust speech rates and provide instant feedback, differentiating from prior approaches that did not offer real-time modification capabilities.

We sourced audio data from the CREMA-D dataset, which includes diverse emotional expressions across different demographics. For preprocessing, we utilized segmentation, normalization, and silence trimming techniques to enhance data quality. Our model incorporates a combination of Random Forest for emotion classification and LSTM networks for capturing temporal speech dynamics, chosen for their effectiveness in handling sequence data and robustness against overfitting.

The model demonstrated moderate success with an overall accuracy of 54% for speech speed classification and a lower 44% for emotion detection, as reflected by F1 scores of 0.53 and 0.34, respectively. The performance indicates a promising direction in speech rate adjustment technology, particularly in improving accessibility for individuals requiring speech therapy, thereby enhancing their communication efficacy in daily interactions. Further optimizations and extended dataset trials are planned to refine the accuracy and applicability of the solution.

Basketball Offense Play Optimization

Dhruv Srivastava, Yash Sangtani

In professional basketball, leveraging advanced data analytics is pivotal for enhancing strategic decision-making and optimizing team performance. With the advent of player tracking technologies like SportVU, rich datasets are now available, providing detailed insights into game dynamics. Our project aims to utilize this data not just for post-game analysis but to augment offensive strategies in real-time, offering coaches and players actionable insights that adapt to live gameplay conditions. Unlike prior studies focused solely on post-game analytics, our project integrates real-time data analytics to dynamically recommend strategic adjustments during the game, empowering teams to adapt their strategies on the fly and significantly improve performance by leveraging real-time insights.

Our data originates from the NBA's SportVU player tracking system, enhanced with play-by-play descriptions to enrich the quantitative data with contextual insights, allowing a comprehensive understanding of game events. We began with logistic regression to establish baseline performance and then implemented sophisticated neural network models to capture the complex spatial-temporal patterns of player movements.

For pass difficulty, we employed a neural network model with a dense layer containing 128 units along with a dropout layer of 0.2 and achieved an overall accuracy of 1.0 and an area under the ROC curve (AUC) of 1.0. The model’s F1 score was 1.0, indicating robust performance in predicting pass completion probabilities. For shot difficulty, we transitioned to a deeper network with four fully connected layers, each comprising 512 units activated by leaky ReLU functions. This model improved prediction accuracy with an AUC of 0.92 and an F1 score of 0.82.

Additionally, we have integrated zonal field goal percentage calculations into our analysis. Players are predicted to be in specific zones based on three categories: "shot zone basic," "shot zone area," and "shot zone range." Using these categorizations, we calculate the field goal percentage (FGP) for each zone, providing tailored insights that help in fine-tuning shooting strategies during the game.

CanDTI: Drug Target Interaction for Non-small Cell Lung Cancer

Aman Paliwal, Arbaaz Shafiq, Arman Ghosh

Machine learning in drug development can accelerate the drug discovery process. It can be used to safely perform computational screening and evaluation of drug - protein interaction. Initial studies focussed on binary classification of interactions as plausible or not. Our project strengthens this approach by predicting a continuous binding affinity value. In addition, we also choose to focus on a specific disease and its associated targets, namely non small cell lung cancer.

We acquire the KIBA dataset consisting of drug SMILE sequences, protein sequences and a KIBA (binding affinity score). We use two separate CNN blocks to encode our sequences after applying basic preprocessing such as standardising sequence lengths and one-hot encoding. The initial convolution layers are followed by a max-pooling layer. The encoded vectors are concatenated and passed through multiple fully connected layers and finally an output layer to obtain a score. For enhancing feature learning and embeddings we train the model on the entire KIBA dataset and then fine tune on Non Small Cell Lung Cancer specific targets.

We achieve an MSE of 0.137, matching the current state of the art ensemble model for general DTI models. Our proposed deep learning approach introduces disease targeting in the DTI process. It also achieves increased accuracy without using ensemble methods and relying on heavy computation.

CNN-LSTM Malware Detection Model

Agaaz Singhal, Ayush Sharma, Shikhar Beriwal

In response to the growing prevalence of malware in Windows PE files and the limitations of traditional signature-based detection methods, this project develops a novel malware detection model utilizing a combination of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. The model leverages a dataset converted into greyscale image formats of PE files, allowing for efficient input into the neural architecture, which includes multiple layers with specific settings such as 48, 32, and 62 filters in the convolutional layers and 96 units in the LSTM layer. By integrating CNN for robust feature extraction and LSTM for capturing temporal dependencies, the model achieves high precision (0.99) and recall (0.99), significantly outperforming traditional methods in accuracy and adaptability to new threats. This approach not only enhances malware detection capabilities but also addresses the critical need for systems that can adapt to evolving malware signatures, offering a substantial improvement over existing technologies.

Correcting Cognitive Biases in Doctor-patient Interactions

Dhruv Menon

It is observed that fatigued physicians often tend to resort to heuristics under time pressure which can lead to incorrect Bayesian update.

Medical errors due to cognitive biases occur in 1.7-6.5 % of all hospital admissions causing up to 100,000 unnecessary deaths each year, and perhaps one million in excess injuries in the USA. In 2008, total cost was USA $19.5 billion. The incremental cost associated on average was about US$ 4685 and an increased length of stay of about 4.6 days.

Addressing biases that arise in data interpretation, feature selection, model evaluation, and deployment is therefore crucial. While previous work has focused on addressing biases in medical large language models, the proposed approach aims to design a personalized model for bias identification and suggestive action in doctor-patient interactions.

The methodology involves training an NLP model for bias identification using manually annotated MedQA dataset from physician-patient encounters. Disease and bias-specific corrective actions are then implemented, followed by training a neural network for Bayesian update and probability map generation.

The performance metrics for the ML model include tuning hyperparameters based on desirable thresholds, such as minimizing type 2 errors in diagnosis and increasing model sensitivity.

CT Scan-based Injury Detection and Classification System

Anagha Vasista, Ona Dubey, Ronit Kadakia

Our project addresses the vital need to improve how brain injuries are diagnosed from CT scans, where accuracy is crucial. Traditional studies have focused on detecting single types of brain injuries, but our work expands this by using a deep-learning method called a convolutional neural network (CNN) to identify six different kinds of brain injuries, including haemorrhages and hematomas.

To achieve this, we used the CQ-500 dataset from Qure.ai, which includes 491 CT scans with 193,317 slices. This large dataset helped us extract features effectively and overcome common biases found in medical imaging datasets through image augmentation techniques. We also introduced a weighting system based on clinical experience to help interpret the scans more accurately.

For our model, we chose a modified version of the ResNet-18 architecture, which is effective at processing complex features from medical images. This model was adjusted to handle single-channel (grayscale) images typical in CT scans and includes 17 convolutional layers and 1 fully connected layer. We fine-tuned the first layer to better capture the details in medical images.

The results show that our model performs well. It achieved an average balanced accuracy of 94.4%, demonstrating strong performance in identifying all types of brain injuries. It also showed high sensitivity and specificity, meaning it can accurately detect when injuries are present or absent. The model's overall accuracy on the test set was 89.56%, and it had an F1 score of 0.932.

In summary, our project significantly improves how brain injuries are diagnosed from CT scans. By using advanced deep learning techniques and a comprehensive dataset, our model reduces errors and increases the consistency of diagnoses, ultimately helping to improve patient care. This approach represents a meaningful step forward in using technology to enhance healthcare.

Deception Detection Using Non-intrusive Methods

Chaitanya Modi, Hibah Muhammad, Subham Jalan

Deception is an ubiquitous challenge in society, with pervasive consequences in security-sensitive environments ranging from law-enforcement to job interviews. The staggering frequency of lies underscores the urgent need for automated methods to detect deception effectively. Previous studies have explored physiological cues (Anna et al., 2006) but fall short in privacy preservation.

Our approach stands out by focusing on techniques such as Computer Vision, NLP, and audio analysis. We acquired the datasets Bag of Lies (procured from IIIT-Delhi). Preprocessing involved diverse techniques tailored to each data type: audio underwent noise removal and feature extraction (MFCC, Pitch, Zero Crossing Rate etc.), video was face-centered and facial features (landmarks, action units, gaze data, expressions) were extracted, and text underwent preprocessing steps (tag removal, case matching, stop words removal etc.)

For the ML component, we employed a combination of algorithms including Neural Networks, SVM, Bayes Classifiers and Random Forest classifiers, with architectures specified for each modality. We are using leave-one-out cross validation accuracy and our test evaluation revealed results; our models achieved competitive accuracies, with text classification reaching 69.7%, audio reaching 61.23%, gaze and video modalities showing accuracies around 66.14% and 58.01%, respectively. These methods were chosen after literature review, trial-and-error, and studying class content with considerations of compute capability. Our model has competitive accuracies with fast inference, which offers a positive contribution to the field of automated deception detection, as well as its societal usage.

Drone-based Crowd Surveillance

Chanakya Rao, Moksh Soni, Vaibhav Chopra

DroneCrowd addresses the critical need for enhanced public safety by developing advanced surveillance techniques to identify potential threats within crowded environments. Prior studies have been conducted to identify drones in the sky and persons in distress in the wilderness; the most relevant one would be a study that tried to map human postures to determine, from a drone feed, whether a particular individual’s actions were violent or not. The focus of our study is an aggregation of thousands of such individuals: a mob. We’re trying to determine mob movement and its level of violence.

The drone itself will fly at a very high height initially and track the crowd movement on the ground. Once it feels the crowd is dense, it will lower its height and just hover over the people to keep an eye on them and detect violence.

We have utilized a variety of pre-existing datasets, the prominent ones being the ‘Violent Flows database’ and the ‘Visodrone DroneCrowd’. Our methodology involved pre-processing the video data followed by employing CNNs. We have created our own CrowdDensityNet and ViolenceNet. The choice of CNNs was motivated by their proven efficacy in handling spatial data.

Law Enforcement Organizations can use our models, with high accuracy, to efficiently allocate manpower and resources.

Generating Meaningful Description of Your Surroundings Using Only Images

Govind, Prashant Mishra, Rahul Kumar

According to the World Health Organization, there are approximately 40 million people globally who are blind, with an additional 250 million experiencing some form of visual impairment. Our project, LucidLens, aims to significantly enhance the independence of visually impaired individuals through innovative technology. Our solution interprets surroundings and converts them into meaningful sentences and further converts it to speech, enabling the blind to comprehend their environment.While there are similar initiatives, such as one focusing on indoor navigation developed by students at Assam Engineering College, our approach distinguishes itself by providing a comprehensive description of the scene rather than merely detecting objects.

Utilizing the Flickr30k dataset, comprising 31,783 images and 158,915 corresponding captions, we ensured data cleanliness and preprocessing. This involved establishing a mapping between captions and image names, converting text to lowercase, removing special characters and punctuation, and tokenizing the data. For image feature extraction, we employed the VGG16 model, renowned for its 16-layer architecture. Our model employs a multimodal architecture, combining image and text inputs. It integrates a dropout regularized dense layer to process image features and an LSTM layer to analyze text sequences. The decoder then merges these representations and employs dense layers for prediction, trained with categorical cross-entropy loss and optimized using the Adam optimizer.

Evaluation of our model's performance was conducted using the BLEU (Bilingual Evaluation Understudy) score, which assesses the quality of generated text. After 30 epochs, our model achieved a BLEU 1 score of 0.461451. While satisfactory, further enhancements are possible through Hyper Parameter Tuning and by adopting Advanced model architecture .

Harmful Brain Activity Classification

Abhinav Lodha, Jiya Agrawal, Nandini Prakash, Pranjal Rastogi

Detecting Harmful Brain Activity using EEGs relies solely on slow, error-prone manual analysis by specialized neurologists. Previous studies in Time-Series Classification have predominantly focused on 1D signal analysis, with limited exploration into leveraging Convolutional Neural Networks (CNN).

Our technique is a WaveNet model, which takes as input 1D EEG signals. It is a dense convolutional neural network with its principal feature being dilation, to increase its receptive field. It is able to remember long term dependencies.

As it is a Kaggle competition, the dataset was given to us by the hosts. It was acquired from the Critical Care EEG Monitoring Research Consortium (CCEMRC), annotated by experts from various institutions such as Massachusetts General Hospital/Harvard Medical School Department of Neurology and many more. Pre-processing involved High Pass Filtering and Independent Component Analysis (ICA) for noise reduction, normalization for aiding model convergence, and dimensionality reduction using Bipolar Double Banana Montage. The chosen algorithms, particularly ICA for its noise-cleaning efficacy and Montage for dimensionality reduction, were driven by their effectiveness in handling EEG data. High-pass filtering was set at a frequency band of 70 Hz, with ICA parameters including a kurtosis threshold of 2 and a correlation threshold of 0.8. The model's performance, guided by the Kullback Leibler Loss Function, demonstrated promising results, with a significant decrease in loss across epochs.

These findings highlight the potential of our approach in accurately classifying harmful brain activity. Our model achieved a KL divergence loss score of 0.4396. The KL loss scores of the top 10 on the leaderboard lie between 0.28-0.27.

Home Credit - Credit Risk Model Stability

Krishna Sharma, Manan Chawla, Sudershan Singh Negi

This project aims to develop an effective predictive model for identifying customers at risk of defaulting on credit obligations using the Home Credit - Credit Risk Model Stability dataset from Kaggle. We preprocess the dataset comprehensively, including handling date features and treating numeric, categorical, and date-time features separately.

Given the dataset's imbalance, with more non-defaulters than defaulters, we evaluate the model using precision and recall metrics, more suitable for imbalanced data than accuracy. The dataset comprises three levels: depth 0 contains single-row entries (e.g., number of children), while depths 1 and 2 have multiple rows per case_id, representing past transactions. We use appropriate aggregation techniques to handle these structures.

We employ various machine learning algorithms, including LightGBM and CatBoost, following a rigorous model selection process. This involves splitting the data into train and validation sets and assessing performance across relevant metrics. A voting classifier averaging model predictions is utilized to finally predict whether a person will default or not.

Humming-based Song Identification

Arya Lamba, Shruti Laddha, Utkarsh Agarwal

We often find ourselves able to recall only fragments of a tune, unable to pinpoint the exact song. Our project aims to identify such songs, when you hum or whistle their tunes.

Past studies have not had the best results: dynamic time warping and other time series methods are slow, while CNN based approaches, like Google's Hum to Search, although accurate, need significant computation and predict only those songs that are used in training.

To overcome these challenges, we use a combination of clustering, CNN and Siamese Network. We create clusters of original songs using the K-means algorithm, then using CNN we assign hums/whistles into specific clusters. Siamese networks accepts pairs as inputs and gives similarity scores. CNN gives high accuracy, while Siamese Network will learn to find similarities accurately.

Our data set consists of 6600 humming samples from kaggle and 50 songs from youtube(different genres). We then collected more data and augmented it to create 100 humming samples using SPICE.

We used ‘sox’ , to remove low intensity background noise, Librosa to extract features such as MFCCs, Mel Spectrogram, Spectral and Chroma features of hums along with some other pre-processing steps.

The CNN-Siamese network approach is apt because CNN proved to be robust at classification tasks with a balanced accuracy of 0.89 while Siamese network had a balanced accuracy of 0.86. The MRR is 0.62 for a database of 25 songs and 0.88 for a database of 7 songs. Having 2 clusters can help the application become computationally lighter and scaleable.

Improving Position Accuracy using Raw GNSS Data - Google Smartphone Decimetre Challenge

Anshuman Sharma, Liza Wahi, Zoya Ghoshal

Accurate GNSS data is essential for applications like navigation, yet it is often compromised by environmental and technical disturbances. Previous methods, focusing on filtering and corrections, have not fully addressed the nonlinearities and integration of multi-source data. This project adopts a deep learning approach to enhance the precision of GNSS positioning predictions.

Our dataset, from the Google Decimeter Challenge, includes GNSS logs and ground truth data from multiple trips and devices. Preprocessing involved signal filtering, normalisation, and feature engineering to improve data quality for model input. Our strategy involved the use of independent predictions from two modelling approaches - Real-Time Kinematic processing (RTKLIB) and a Long Short-Term Memory (LSTM) model based on their accuracy. The LSTM comprised of two layers with 64 and 32 units each, designed to capture temporal dependencies in GNSS sequences. Evaluation of the LSTM outputs was done and on the basis of this, selective replacement from the RTKLIB solution was performed.

The scoring metric is based on the mean of the 50th and 95th percentile distance errors between predicted and actual latitude/longitude coordinates for each phone model. Lower scores indicate higher accuracy and better solution performance. Our solution evaluated to 1.807m.

Kelp Segmentation in Multispectral Images

Jia Bhargava, Pratham Arora, Tanay Srinivasa

Our project addresses the critical need for monitoring and preserving kelp forests, essential ecosystems under threat from climate change and human activity. Giant Kelp contributes to over $500B in economic spending annually. Additionally Kelp and other ocean algae generate 50-85% of the Earth’s oxygen. However, monitoring these habitats can be difficult because they change rapidly in response to temperature and wave disturbances. Leveraging satellite imagery, we propose an approach utilizing remote sensing indices and Convolution Neural Networks to boost classification accuracy.

Our dataset, obtained from a DataDriven competition, comprises of 5,635 multispectral images from Landsat-2 Satellites. We preprocess images by extracting 6 remote sensing indices: NDVI, NDWI, MNDWI, CI, ARVI, and SVI, based on previous research done in this field. Due to a noisy dataset, we developed a streak detection CNN to eliminate images with artifacts, achieving an F1 score of 0.952. The image is then passed through a Kelp Detection Model, which uses a voting classifier based on the predictions of ResNet-18, ResNet-34, ResNet-50, and Inception. This Kelp Detection model achieves an F-1 Score of 0.84. We also leverage Transfer Learning for Kelp Segmentation on a pre-trained U-Net model trained using the imagenet dataset with an EfficientNet-b3 encoder. This model takes a MNDWI feature image as its input and predicts where kelp is present in the image. The segmentation model achieves a mean dice coefficient of 0.553.

Mining the Mines Using Data Mining

Bilawal Singh Deu, Dhirain Vij, Sakarth Brar

Our project aims to build a model to detect mining sites using Satellite images so as to address the inability to properly monitor mine development and their associated risks to the population and the environment often resulting from a lack of resources in developing countries.

Our first model utilizes a dataset consisting of 1242 labelled Sentinel 2 satellite image tiles, covering mining areas across multiple countries. Pre-processing steps include feature engineering to derive spectral indices like ENDMI (Enhanced Normalized Difference Mine Index) and NDVI (Normalized Difference Vegetation Index) from the original multispectral bands and data augmentation.

The machine learning methodology employs a Vision Transformer (ViT) architecture, pre-trained on ImageNet, with modifications to accommodate the 14-channel input (12 original bands plus ENDMI and NDVI). The model is fine-tuned using a cross-entropy loss function with label smoothing and weighted class balancing to handle the imbalanced distribution of mining and non-mining sites. To mitigate overfitting and enhance generalization, strategies like k-fold cross-validation (k=5) are implemented. The evaluation metric is the F1 score. The current best model achieves an F1 score of 0.94 on the held-out validation set.

For our segmentation model, we make use of a dataset that comprises 1790 polygons within India of features like waste rock dumps, pits, water ponds, tailings dams, heap leach pads and processing/milling infrastructure. Pre-processing steps include Sentinel-2 images extraction from Google Earth Engine, preparation of masks, resizing, augmentations and normalisation. The ML methodology includes a Feature Pyramid Network architecture with a ResNet34 encoder for semantic segmentation on binary mine vs. background images, a specialized loss function (Dice Loss) configured for binary segmentation tasks and Adam optimization algorithm with a fixed learning rate of 0.0001. This model currently yields an IoU score of 0.54.

Movie Plot-based Genre Classification

Anshika Singh, Harsh Mishra, Rohit Singh

The project aims to develop a multi-label genre classification system for movies. The need for this project arises from the growing demand for personalized content recommendation systems in streaming platforms. By accurately predicting multiple genres associated with a movie, we can enhance user experience by providing more relevant recommendations.

While existing approaches have successfully utilized techniques like TF-IDF and Word2Vec for feature extraction, they may not fully capture the intricate interplay of themes and plot twists inherent in movie narratives. Furthermore, current models may struggle with handling overlapping genre classifications, where a movie could belong to multiple genres simultaneously.

After rigorous data preprocessing using functions such as stopword, lemmatization and wordcloud from the nltk library, we used TF-IDF and Count Vectorizer for feature extraction. We employed Binary Relevance to transform the multi-label classification problem to multi-class the problem, alongside classifiers such as Linear SVC, Multinomial Naïve Bayes and Logistic Regression aiming for higher accuracy and interpretability.

We made 4 models by creating pipeline of of text encoder (Tf-Idf and Count Vectorizer), classifier (SVC, MNB and LR) and problem transformer (Binary Relevance). Precision, Recall and F-1 score were used as performance indicator. The best combination was Tf-Idf with binary relevance and SVC and the weighted f-1 score was 0.76. This project aims to improve the model's capability to accurately discern and assign multiple labels to such complex cases to significantly enhance its predictive accuracy and utility.

Multiclass Prediction of Obesity Risk

Abhivarya Kumar, Gaurav Agarwal, Vedant Singh

Obesity, the 5th largest cause of death globally, can shorten our life by up to 10 years. Accurate prediction of obesity lies at the heart of preventive measures and early intervention. Leveraging Machine Learning, we aim to predict obesity risk using relevant factors like diet, family history, and meal frequency.

Our dataset, obtained from a previous Kaggle Competition and provides comprehensive features like BMI, family history, physical activity and dietary habits. For data pre-processing we divided the data into numerical and categorical features and applied appropriate encoding techniques like binary and ordinal encoding. Further, feature engineering was done to include relevant features lie BMI, calorie intake and physical activity levels.

We used a variety of Machine Learning algorithms including logistic regression, decision trees, random forests, support vector machines (SVMs), XGBoost, and LightGBM. These algorithms were chosen due to their ability to handle multi-class classification problems, high-dimensional data and non-linear relationships. We also plan on leveraging a Convolutional Neural Network (CNN) with hidden layers. It is designed to identify and classify various types of Obesity and is a different approach from traditional classifiers.

We achieved promising results from our models with the highest accuracy of 0.89 given by Random Forest and XGBoost, effectively capturing the complex relationships between lifestyle factors and obesity. Our project is quite significant as it is a comprehensive approach to obesity risk prediction and can help mitigate the physical and mental burdens associated with obesity.

NeuroTone.AI: Predicting Onset of Parkinson's Disease using Auditory Cues

Abhigyan Mehrotra, Pratik Rana, Prerit Rathi

Parkinson's disease (PD) is a neurodegenerative disorder affecting over 10 million people globally. Voice impairment is one of the earliest symptoms in around 90% of PD patients. Detecting PD from speech data can enable timely diagnosis and treatment. While previous studies explored machine learning models on English speech datasets, there is a need for robust multilingual models given the global prevalence of PD across diverse populations.

This study aims to develop machine learning methods for detecting PD from multilingual speech recordings in English and Italian. Current ML models often fail to generalize across multiple languages as they are typically single-language models. Speech samples were collected from PD patients and healthy controls across both languages. Audio data was converted to CSV format, extracting voice features like MFCCs, pitch, amplitude, and noise measures. Data pre-processing included normalization, standardization, exploratory data analysis, and linguistic labelling.

Multiple machine learning algorithms were evaluated, including Random Forests, LightGBM, Support Vector Machines (SVMs), KNN, Catboost, Extra Trees and neural network. These algorithms were selected for their ability to effectively handle linear, non-linear, numeric, and textual data while being robust to noise and outliers. We achieved close to 85% accuracy using extra trees classifier with 0.93 AUC.

Stratified cross-validation is used to assess model performance using metrics such as accuracy, the area under the ROC curve (AUC), precision, recall, and F1 score across both languages. Ensemble techniques combining the strengths of these models may also be explored.

The resulting multilingual system aims to detect PD from voice data across diverse linguistic populations accurately. Identifying the most discriminative voice features and linguistic insights could inform future research into vocal biomarkers of PD. Ultimately, this could enable cost-effective screening and monitoring of at-risk global populations.

Optimising Fantasy Cricket Team Selection using Machine Learning

Hemant Gupta, Madhvendra Singh, Samarth Anand

This project addresses the challenge of predicting the best fantasy cricket team compositions for the Indian Premier League (IPL) using historical player performance data. Previous studies have employed various machine learning approaches for team selection, such as Random Forest, XGBoost, & ExtraTrees Regressor. Our approach differs by integrating a truth value algorithm that accounts not only for the various features of the players but also the DREAM11 scores of all players across the history of IPL i.e. from 2008 to 2024.

We have collected ball-by-ball data for all IPL matches through cricsheet.org, an open-source dataset for all international & league cricket matches. This has helped ensure a robust foundation for analysis. Other than cleaning the data for usage, we employed Principal Component Analysis to reduce dimensionality and tackle multicollinearity among variables.

We’ve decided to perform Logistic Regression & Time Series Prediction using LSTM. Our current log loss using logistic regression is 0.68. Our accuracy with LSTM was 50.09%. We’ve decided to use 'error-rate' as our performance metric. This will be a comparison between the team our algorithm predicts, and the Dream Team released by DREAM11 post every match.

Pneumothorax Segmentation of Chest Radiographs

Amog Rao, Ananya Shukla, Nikhil Henry

Pneumonothorax, also called a collapsed lung is an acute pathology associated with the presence of air in the pleural cavity between the lung and the chest wall, caused due to traumatic or spontaneous etiologies that may be life-threatening. The manual detection of Pneumonothorax from chest radiographs and then visual segmentation of a Region-of-Interest (RoI) may lead to delayed or inaccurate diagnosis of Pneumothorax. Previous studies leverage a multi-input and multi-output architecture by frontal and lateral X-ray fusion incorporating residual blocks achieving image-level annotation accuracy on Resnet-50 and DenseNet-121 with an AUC of 0.932 and 0.923 [Luo, 2022].

Our implementation is a single-instance model to optimise for low compute and decreasing the cost of inference and training. The usage of non-destructive augmentation techniques has been incorporated to preserve the feature set, while tackling class imbalance and maintaining a balanced distribution in the training and population distribution in the validation.

The proposed methodology is based on transfer learning utilizing pre-trained backbones (on ImageNet) as encoders within the UNet Encoder-Decoder-based architecture for binary semantic segmentation tasks. The EfficientNet-B4 has been selected as the encoder for this task as it makes the desired trade-off between the number of parameters (17M) and achieving high accuracy. Additionally, a combo weighted loss of Dice and Binary Cross Entropy was used with 1 and 0.5 lambda values respectively. This lead to convergence for the best IoU score of 0.8044 within 42 epochs on the train at. Moreover, both a CosineAnnealingWarmRestarts and ReduceLROnPlateau scheduler from PyTorch with varying patience was used to arrive at a possibly deeper local minimum of the loss function.

To evaluate our segmentation networks, we first use the Intersection over Union (IoU) which is the area of overlap between the ground truth (Ptrue) and the predicted segmentation (Ppredicted) divided by the area of union between them. Also, we’ve used accuracy, recall, precision, and F-measure (F1-Score) equivalent Dice Coefficient which is used to compare the pixel-wise agreement between a predicted segmentation and its corresponding ground truth along with Equal Opportunity Difference (EOD) to have a fair understanding of bias towards prevalence, chest-tubes and AP / PA ratio.

Predicting Dengue Spread Using Local Geospatial and Weather Features

Bharat Jain, Soorya KS, Usman Akinyemi

Dengue fever, a rapidly spreading global health threat, affects millions of individuals annually, leading to significant morbidity and mortality worldwide. With up to 400 million infections and 40,000 deaths reported each year, the impact of dengue fever necessitates innovative solutions to combat its spread. Our project aims to address the need for accurate prediction of dengue fever outbreaks to enhance healthcare planning and resource allocation. Previous studies have utilized machine learning algorithms such as Support Vector Machine, Decision Tree, k-means, Naive Bayes, and Random Forest for dengue prediction. However, this project stands out by focusing on developing a model that predicts the total number of dengue cases per week in San Juan and Iquitos using a dataset with key features like weekofyear, temperature, precipitation and various NDVI values.

The dataset was collected from the two cities and underwent feature preprocessing one of which is adding lagg features. Earlier we used Principal Component Analysis (PCA) to reduce dimension which revealed that three components is able to explain 95% of the dataset's variability. We later had to drop the PCA as we got better performance when using all the features. For the machine learning component, a diverse set of algorithms was employed, including Extra Tree Regressor, Light Gradient Boosting Machine, Random Forest Regressor, Support Vector Regression, and Support Vector Machine. These algorithms were selected based on their proven effectiveness in handling regression tasks and their ability to handle the dataset's characteristics. Evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and 10 K-Fold Cross Validation were used to assess the model's performance.

Pronunciation Feedback System

Devansh Upadhyaya, Vardan Vij, Yashvi Maheshwari

Our project addresses the crucial need for improved pronunciation tools in language learning, an area often overlooked by conventional educational resources. Despite existing studies focusing on speech recognition, our system uniquely integrates real-time feedback using advanced machine learning to enhance pronunciation skills effectively.

We leveraged the Mozilla Common Voice dataset, comprising over 100,000 audio samples from native speakers, to train our model, ensuring exposure to diverse linguistic nuances. Employing Recurrent Neural Networks (RNNs) for processing and analyzing speech, we chose these algorithms for their proficiency in handling sequential data, which is paramount in speech recognition. The RNN architecture is fine-tuned with multiple layers to effectively learn and predict the subtleties of spoken language.

Our preliminary analysis show promising results, indicating that the system not only accurately assesses pronunciation but also offers detailed feedback on fluency, accuracy, and completeness. This immediate and precise feedback is designed to be intuitive, significantly boosting learners' confidence and accelerating their progress in mastering a new language. The significance of our work lies in its potential to transform traditional language learning by filling a critical gap and providing a personalised educational tool that addresses the specific challenges of pronunciation training.

Reimagining Playing Blackjack Using RL

Dipit Golechha, Harsh Siroya, Krishnav Mahansaria

This project explores the application of the Double Q-learning algorithm to optimize blackjack strategy. By implementing and comparing 125 combinations of hyperparameters (gamma, alpha, epsilon), we aim to address the overestimation bias inherent in traditional Q-learning methods. Our approach involved iterative training over 20,000,000 episodes, followed by evaluation against baseline and random strategies. The performance metrics—wins, losses, cumulative winnings, and win percentage—were visualized to highlight the effectiveness of our strategy.

Results indicate that specific hyperparameter combinations significantly enhance performance, achieving win percentages over the optimal startegy. The Double Q-learning algorithm demonstrated its potential to refine decision-making in uncertain environments, suggesting broader applicability in strategic AI deployment. This work contributes to the understanding of reinforcement learning applications in gaming, offering insights into more stable and reliable learning outcomes in blackjack.

In comaprison to basic startegy (optimal strategy):
Wins Comparison: Our strategy consistently achieved higher wins compared to the baseline and random strategies.
Losses Comparison: Our strategy maintained lower losses than the baseline and random strategies.
Cumulative Winnings Comparison: Our strategy outperformed in cumulative winnings.
Percentage Comparison: Our strategy demonstrated a higher win percentage.

SCAN: Summarized Context Aware News

Niharika Gupta, Shivangi Agarwal, Tanmay Nanda

As digital news volumes increase, the demand for concise, context-sensitive summaries grows more pressing. Our project meets this challenge by automating the generation of dependable summaries and conducting comprehensive sentiment analysis to improve the accessibility of online news and mitigate bias.

We initially used the labelled Global News Dataset to assign sentiment labels to titles in the unlabeled BBC News Dataset via a semi-supervised method. However, due to its limited accuracy, this technique was not included in our final model design. Instead, we utilised a CNN with three convolutional layers, achieving an accuracy and F1 score of 83%.

For our abstractive text summarizer, we created target summaries using Google's Gemini model and implemented two models for generating summaries: an encoder-decoder framework with LSTM that utilises a teacher forcing mechanism and a BART transformer model enhanced with transfer learning. We assessed the effectiveness of our models using ROUGE scores and the G-Eval metric, both of which indicated strong performance. We achieved a ROUGE-1 score of 0.48, a ROUGE-2 score of 0.23, a ROUGE-L score of 0.38, and an overall G-Eval score of 3.5 out of 5, demonstrating the robustness of our approach.

Sentiment Analysis of RBI Monetary Policy Committee Meetings with NLP

Abhiraaj Sharma, Ayush Bhatnagar, Tanushi Khandelwal

Central bank communications, especially monetary policy statements, influence financial markets by shaping expectations around policy stance and rationale. Previous studies have employed traditional natural language processing techniques like bag-of-words and dictionary-based sentiment analysis. However, these methods lack the ability to capture nuanced context and complexity of central bank language. Our project leverages advanced deep learning models like LSTMs with attention mechanisms to provide comprehensive sentiment analysis of Reserve Bank of India (RBI) monetary policy statements and predict their impact on the NIFTY50 index.

We created a dataset by extracting text from RBI policy statements ranging from 2000-2024 and preprocessing it through lemmatization and removing stopwords. For feature extraction, we used dictionary-based sentiment scores, readability indices like FJP, and contextualized word embeddings from doc2vec models. The LSTM architecture incorporated attention layers and was trained on document-level sentiment labels derived from NIFTY50 weekly delta values after detrending for benchmark returns. Our approach achieved promising results with a precision of 0.67, recall of 0.67 and balanced accuracy of 0.64 on the test set. By quantifying sentiments in RBI communications, this analysis can provide valuable insights to market participants on the central bank's policy stance, enhancing investment decision making capabilities.

Song.ly - Music Genre and Sentiment Analyser

Manjree Kothari, Roma Sahu, Vandita Lodha

Music plays a vital role in expressing our emotions in today's digital age. Yet, understanding the feelings and genres of songs can be tricky. That's why we're developing a machine learning model to determine music's genre and sentiment. We aim to help people better understand themselves and enjoy and appreciate the music they listen to more.

Our dataset was acquired by web scraping and utilizing platforms such as Genius API to extract lyrics for our songs to carry out Sentiment Analysis on English Music. Through thorough EDA, we realized we had almost 30 features which we reduced to features with a 95% importance. Using a Random Forest Classifier, we have achieved a score of >0.999 ROC-AUC Score for Genre Identification, and 0.947 for Sentiments. We are currently classifying 10 'Genres' (Pop, Rock, Classical, Country, Disco, Hip-hop, Jazz, Metal, R-n-b and Bollywood) and 6 'Moods' (Sad, Sleep, Chill, Happy, Romance and Angry).

Our comparative analysis involved SVM and Random Forest classifier, with TF-IDF, NLTK and word2vec. We got the best results from Random Forest classifier with TF-IDF.

Store Sales and Time Series Forecasting

Jahnavi G. Shankar, Japsahaj Kaur, Rishav Kumar

Our project is all about helping Corporación Favorita, a beloved grocery store chain in Ecuador, improve how they predict sales. By getting better at knowing what customers will buy and managing their inventory, we're helping them keep their shelves stocked and their business thriving. Previous studies often focused on univariate time series data, our approach tackles the complexity of multivariate time-series data by applying machine learning techniques like linear regression, XG Boost, and LSTM.

To achieve this, we meticulously collected and merged sales data with other pertinent information like dates, stores, items, and transaction comprehensive datasets for analysis. Employing techniques like lag feature engineering and seasonal trend analysis, we preprocessed the data to extract meaningful insights. We have covered underlying patterns by analyzing the trend, seasonality, and cycles in the data. Linear regression and LSTM apply to make the sales prediction.

Our chosen evaluation metric, RMSLE, validates the effectiveness of our methodology. The training phase yielded a noteworthy RMSLE value of 0.588 for the linear regression model, underscoring its precision in forecasting future sales. Through the integration of Linear regression and XG Boost, we obtained a validation error of 0.0150, whereas our test RMSLE for LSTM is 0.756 and for Linear regression is 0.510. Through thorough data preprocessing, and applying machine learning techniques our project plays a pivotal role in advancing sales prediction. Our overarching goal is to enhance retail operations by promoting ongoing refinement and ultimately bolstering the profitability of enterprises like Corporación Favorita in today's dynamic marketplace.

VeriFact: Fake News Detection and Predicting its Degree of Factuality

Anish Sridharan, Purv Sogani, Swarup Patil

In a world where, thanks to AI, fake news can now become near undetectable by humans, our team attempts to tackle this as a multi-class classification problem by returning one of the following- - half-true - false - true - barely-true - pants-fire

We chose LIAR as our dataset, which is a publicly available dataset consisting of a decade-long of 12.8K manually labeled short statements in various contexts.

In deciding the algorithms that may be fit for the task, we faced many problems that led us to eliminate many algorithms. For example, we eliminated supervised machine learning models after finding they are outperformed by other models, while we eliminated RNNs to prevent the exploding/vanishing gradient problem.

We eventually settled on using Bi-LSTM, which gave us 27% six-class accuracy, 5% above the accuracy achieved by the authors of the paper using the same method, and on par with the best test accuracy in the paper.

Weakly Supervised Word-level Pronunciation Error Detection and Feedback System for Local Languages

Arnav Rustagi, Nimrat Kaur, Satvik Bajpai

This paper investigates the application of Computer-Assisted Pronunciation Training (CAPT) methods to Indian native languages, with a specific focus on Hindi. While extensive research exists on CAPT for English language learning, there is a dearth of literature exploring its application in this domain. We address this gap by leveraging the work done by researchers like Daniel Korzekwa.

We explore the utilization of the Multilingual and code-switching ASR challenge Hindi dataset, but a key challenge in CAPT systems lies in the availability of training data concerning mispronounced speech. To tackle this problem we produced accurate synthetic mispronunciations. Additionally, we propose a novel L2 (non-native speaker) Hindi speech dataset for fine-tuning purposes.

Our proposed system employs a weakly supervised encoder-decoder architecture. The encoder utilizes a Recurrent Convolutional Neural Network (RCNN) for feature extraction, while the decoder is a shared Attention-based Recurrent Neural Network (ARNN) to effectively model phoneme sequences. To mitigate overfitting, we implement a multi-task training approach encompassing two tasks: 1) Phoneme Recognition Network (PRN) and 2) Mispronunciation Detection Network.

We got the following metrics for the model:

Weighted F1_score: 82.24%
Accuracy: 76.32%
Precision: 89.94%
Recall: 75.48%

Furthermore, this paper introduces a novel method for providing targeted feedback to learners regarding their mispronounced speech. This contribution has the potential to significantly enhance the effectiveness of CAPT systems for Indian native languages.

WordWeaver: The ML Reading Revamp

Anirudh Chauhan, Ronak Mir, Vijeta Raghuvanshi

Our model presents a comprehensive approach to evaluating reading proficiency using machine learning techniques applied to audio samples. Leveraging the extensive LibriSpeech ASR corpus, our model integrates multiple factors crucial for effective reading, including Expression and Volume, Phrasing and Intonation, Smoothness and Pace. Employing a SVM Support Vector Machine model, we achieved notable accuracies on real-world test data, with scores ranging from 66.40% to 76.52% across four proficiency levels. Corresponding F1 scores further validate the model's effectiveness, with scores ranging from 55.72% to 71.23%. The large and diverse nature of the LibriSpeech corpus, comprising approximately 1000 hours of read English speech, enables our model to generalize across accents, dialects, and reading styles. Overall, our approach offers a powerful tool to support reading development for learners of all ages and backgrounds, providing nuanced analyses and data-driven feedback to build confident and capable readers.

Machine Learning and Pattern Recognition

AI3011 | Spring 2025

Spring 2024

Abnormality Detection Using Heartbeat Sounds

Angad Singh, Malhar Bhise, Suhani Jain, Yuvna Jain

Accent Recognition

Mukundan Gurumurthy, Parth Ghule, Srirangadeep Yarlagatta

AdaptiveSpeech: Dynamic Speech Rate Adjustment for Effective Communication

Manavpreet Singh Cheema, Mudit Surana, Vaisakh Menon

Basketball Offense Play Optimization

Dhruv Srivastava, Yash Sangtani

CanDTI: Drug Target Interaction for Non-small Cell Lung Cancer

Aman Paliwal, Arbaaz Shafiq, Arman Ghosh

CNN-LSTM Malware Detection Model

Agaaz Singhal, Ayush Sharma, Shikhar Beriwal

Correcting Cognitive Biases in Doctor-patient Interactions

Dhruv Menon

CT Scan-based Injury Detection and Classification System

Anagha Vasista, Ona Dubey, Ronit Kadakia

Deception Detection Using Non-intrusive Methods

Chaitanya Modi, Hibah Muhammad, Subham Jalan

Drone-based Crowd Surveillance

Chanakya Rao, Moksh Soni, Vaibhav Chopra

Generating Meaningful Description of Your Surroundings Using Only Images

Govind, Prashant Mishra, Rahul Kumar

Harmful Brain Activity Classification

Abhinav Lodha, Jiya Agrawal, Nandini Prakash, Pranjal Rastogi

Home Credit - Credit Risk Model Stability

Krishna Sharma, Manan Chawla, Sudershan Singh Negi

Humming-based Song Identification

Arya Lamba, Shruti Laddha, Utkarsh Agarwal

Improving Position Accuracy using Raw GNSS Data - Google Smartphone Decimetre Challenge

Anshuman Sharma, Liza Wahi, Zoya Ghoshal

Kelp Segmentation in Multispectral Images

Jia Bhargava, Pratham Arora, Tanay Srinivasa

Mining the Mines Using Data Mining

Bilawal Singh Deu, Dhirain Vij, Sakarth Brar

Movie Plot-based Genre Classification

Anshika Singh, Harsh Mishra, Rohit Singh

Multiclass Prediction of Obesity Risk

Abhivarya Kumar, Gaurav Agarwal, Vedant Singh

NeuroTone.AI: Predicting Onset of Parkinson's Disease using Auditory Cues

Abhigyan Mehrotra, Pratik Rana, Prerit Rathi

Optimising Fantasy Cricket Team Selection using Machine Learning

Hemant Gupta, Madhvendra Singh, Samarth Anand

Pneumothorax Segmentation of Chest Radiographs

Amog Rao, Ananya Shukla, Nikhil Henry

Predicting Dengue Spread Using Local Geospatial and Weather Features

Bharat Jain, Soorya KS, Usman Akinyemi

Pronunciation Feedback System

Devansh Upadhyaya, Vardan Vij, Yashvi Maheshwari

Reimagining Playing Blackjack Using RL

Dipit Golechha, Harsh Siroya, Krishnav Mahansaria

SCAN: Summarized Context Aware News

Niharika Gupta, Shivangi Agarwal, Tanmay Nanda

Sentiment Analysis of RBI Monetary Policy Committee Meetings with NLP

Abhiraaj Sharma, Ayush Bhatnagar, Tanushi Khandelwal

Song.ly - Music Genre and Sentiment Analyser

Manjree Kothari, Roma Sahu, Vandita Lodha

Store Sales and Time Series Forecasting

Jahnavi G. Shankar, Japsahaj Kaur, Rishav Kumar

VeriFact: Fake News Detection and Predicting its Degree of Factuality

Anish Sridharan, Purv Sogani, Swarup Patil

Weakly Supervised Word-level Pronunciation Error Detection and Feedback System for Local Languages

Arnav Rustagi, Nimrat Kaur, Satvik Bajpai

WordWeaver: The ML Reading Revamp

Anirudh Chauhan, Ronak Mir, Vijeta Raghuvanshi