Blog posts

2026

Exploring UK Road Accidents: What 104K Collisions Tell You Before You Model

Published:

Before building any model, you need to understand the data well enough to make defensible modeling decisions. This post walks through how I approached exploratory data analysis on the UK Department for Transport’s 2023 road accident dataset — 104,258 collisions and 189,815 vehicle records. The EDA directly shaped the dual-model strategy I describe in my class imbalance post.

Tackling Extreme Class Imbalance: UK Road Accident Severity with LightGBM

Published:

When 76% of your labels belong to a single class and the rarest class sits at 1.4%, standard classifiers will happily predict the majority class every time and report impressive accuracy. This post walks through what I learned building a severity classifier on 104,258 UK road collisions from the Department for Transport’s 2023 STATS19 data, and why the numbers that looked great at first turned out to be completely wrong.

LatexForLLM: Turning LaTeX Papers into Graphs for Smarter LLM Retrieval

Published:

If you have ever pasted an entire research paper into ChatGPT or Claude and watched your token budget evaporate, you know the problem. A typical 10-page paper burns 8,000-12,000 tokens, yet the model only needs a few hundred to answer most questions about it. I built LatexForLLM to fix this. It parses LaTeX documents into a typed graph and retrieves only the sections, equations, and figures that matter. On benchmark tasks against a realistic 200-line paper, graph-based retrieval cuts word count by ~54% on average (up to ~80% for focused queries) compared to pasting the full document.

2024

Leveraging Pandas to Interact with SQL

Published:

Most data work involves going back and forth between SQL databases and Python. You write a query to pull what you need, load it into a DataFrame, do your analysis, maybe write results back. Pandas has built-in support for this workflow, and once you set it up, you rarely need to leave Python to interact with your database.

Stock analysis

Published:

In this notebook we will carry the trend analysis of technology stocks.

2023

Time series analysis

Published:

We will discuss the time series analysis using finance data. The techniques like Moving Average (MA) , Autoregressive (AR) and Autoregressive Integrated Moving Average Model (ARIMA) will be dicussed. For modelling the time series we will be using the statsmodel library for the data acquired using Yahoo finance api.

Portfolio Analysis

Published:

We create a portfolio of stocks from American markets, analyze their performance and try to acess the risk in future.