Latest Posts (20 found)
nathan.rs 3 weeks ago

The Short Life and Death of a Research Idea

For Cal Hacks 2025 , a few friends and I built Curserve , a fast and scalable server-side engine for agentic coding, which ended up placing for one of the sponsor prizes. We didn’t go to Cal Hacks to try and win, but instead to have a good excuse to work on a potential research idea. It turns out that our idea was a much better hackathon project than it was a research direction!

0 views
nathan.rs 1 months ago

BERT is just a Single Text Diffusion Step

A while back, Google DeepMind unveiled Gemini Diffusion , an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time, Gemini Diffusion creates whole blocks of text by refining random noise step-by-step. I read the paper Large Language Diffusion Models and was surprised to find that discrete language diffusion is just a generalization of masked language modeling (MLM), something we’ve been doing since 2018 . The first thought I had was, “can we finetune a BERT-like model to do text generation?” I decided to try a quick proof of concept out of curiosity.

0 views
nathan.rs 1 months ago

Research Log

I’m doing my master’s thesis around distributed low-communication training. Essentially, how can we train large models efficiently across distributed nodes and not be utterly destroyed by network latency and bandwidth? Below is some of what I’ve learned and investigated throughout the days. A desirable problem to solve is being able to use different kinds of hardware for training. Even within the same generation, NVIDIA B300 GPUs are 50% faster than B200s. Companies like Meta have many homogeneous clusters that differ in hardware. It would be ideal to be able to train a model across clusters regardless of the exact underlying hardware used.

0 views
nathan.rs 6 months ago

Running GPT-2 in WebGL: Rediscovering Classic GPGPU Programming

A few weeks back, I implemented GPT-2 using WebGL and shaders ( Github Repo ) which made the front page of Hacker News ( discussion ). This is a short write-up over what I learned about old-school general-purpose GPU programming over the course of this project. In the early 2000s, NVIDIA introduced programmable shaders with the GeForce 3 (2001) and GeForce FX (2003). Instead of being limited to predetermined transformations and effects of earlier GPUs, developers were now given unprecedented control over the rendering pipeline, enabling much more sophisticated visual effects. These programmable shaders laid the foundation for modern GPU computing.

0 views
nathan.rs 1 years ago

Mathematical Statistics

My notes over Mark Maxwell’s course, Introduction to Mathematical Statistics, and his textbook, Probability & Statistics with Applications, Second Edition. Normally in a probability experiment, we don’t know the true values of a model’s parameters, and therefore, we must estimate them using random observations. Because the observations are random, our estimates are subject to the vagaries of chance. We find ourselves in a paradoxical situation in which the parameters are fixed, but unknown, while the estimates are random, but observable.

0 views
nathan.rs 1 years ago

Common Probability Distributions

An overview of common discrete and continuous distributions found in probability and statistics, from Mark Maxwell’s textbook, Probability & Statistics with Applications, Second Edition. A random variable $X$ is said to have a discrete uniform distribution if its probability function is: $$Pr(X=x)=\frac{1}{n}$$ for $x=1,2,\dots,n$. Expected Value: $$E[X ]=\frac{n+1}{2}$$ Variance: $$Var[X ]= \frac{n^2-1}{12}$$ A Bernoulli trial is an experiment that has two outcomes (true-false; girl-boy, success-fail, in-out, etc). An overview of common discrete and continuous distributions found in probability and statistics, from Mark Maxwell’s textbook, Probability & Statistics with Applications, Second Edition. Common Discrete Distributions # Discrete Uniform # A random variable $X$ is said to have a discrete uniform distribution if its probability function is: $$Pr(X=x)=\frac{1}{n}$$ for $x=1,2,\dots,n$. Main Properties # Expected Value: $$E[X ]=\frac{n+1}{2}$$ Variance: $$Var[X ]= \frac{n^2-1}{12}$$ Median: Same as Expected Value

0 views
nathan.rs 1 years ago

How to Fix Hugo's iOS Code-Block Text-Size Rendering Issue

Lately, I’ve been coming across many blogs that have weird font-size rendering issues for code blocks on iOS. Basically, in a code snippet, the text-size would sometimes be much larger for some lines than others. Below is a screenshot of the issue from a website where I’ve seen this occur. As you can see, the text-size isn’t uniform across code block lines. I’ve seen this issue across many blogs that compile markdown files to HTML such as sites built using Hugo, Jekyll, or even custom md-to-html shell scripts .

0 views
nathan.rs 2 years ago

Intro to Autograd Engines: Karpathy's Micrograd in Go

For a while, I wanted to build a complete autograd engine. What is an autograd engine, you might ask? To find the answer, we first must know what a neural network is. A neural network can just be seen as a black-box function. We pass in an input into this black box and receive an output. Normally, in a function, we define the rules on how to manipulate the input to get an output. For example, if we want a function that doubles the input, i.e $f(x) = 2x$, then all we would write is:

0 views
nathan.rs 2 years ago

Where Rust Shines: Algebraic Types and Match Statements

Recently I was going through Thorsten Ball’s “Writing An Interpreter in Go”. In this book, you create a basic interpreted language and write a lexer, parser, evaluator, and REPL for it. A Lexer takes in source code and turns it into an intermediate representation, usually in the form of a string of tokens. This is called Lexical Analysis. A parser usually takes this stream of tokens and turns it into an Abstract Syntax Tree which is then evaluated and run.

0 views
nathan.rs 2 years ago

Favorite Quotes

Here are a few of my favorite quotes I’ve liked over the years. “I believe that a man should strive for only one thing in life, and that is to have a touch of greatness” — Félix Martí-Ibáñez “In Three Words, I Can Sum Up Everything I’ve Learned About Life. It Goes On” — Robert Frost “But you see,” said Roark quietly, “I have, let’s say, sixty years to live. Most of that time will be spent working. I’ve chosen the work I want to do. If I find no joy in it, then I’m only condemning myself to sixty years of torture. And I can find the joy only if I do my work in the best way possible to me. But the best is a matter of standards—and I set my own standards. I inherit nothing. I stand at the end of no tradition. I may, perhaps, stand at the beginning of one.” ― Ayn Rand, The Fountainhead

0 views
nathan.rs 2 years ago

Favorite Books

Below are all the books I’ve read since middle school, roughly in order. Those highlighted in blue were those I particularly enjoyed :) Willpower - Roy F. Baumeister & John Tierney Deng Xiaoping and the Transformation of China - Ezra F. Vogel

0 views
nathan.rs 2 years ago

Gradient Descent & Optimizers

Theses are some of my over Qiang Liu’s course, Machine Learning II. Gradient Descent is a fundamental, first-order iterative optimization algorithm designed for minimizing a function. The primary objective of Gradient Descent is to find the minimum value of a function by iteratively moving towards the minimum of the gradient. Update Rule: The parameters $ \theta $ are updated as follows in each iteration:

0 views
nathan.rs 2 years ago

Language Modeling: Word Embedings & Architectures

These are a few of my notes from Eunsol Choi’s NLP class at UT Austin. Word embeddings are a type of word representation that captures the semantic meaning of words in a continuous vector space. Unlike one-hot encoding, where each word is represented as a binary vector of all zeros except for a single ‘1’, word embeddings capture much richer information, including semantic relationships, word context, and even aspects of syntax.

0 views
nathan.rs 2 years ago

Neural Networks: RNNs, Seq2Seq, & CNNs

These are a few of my notes from Eunsol Choi’s NLP class at UT Austin. Recurrent Neural Networks (RNNs) are a class of artificial neural networks specifically designed to tackle sequence-based problems. Unlike traditional feedforward neural networks, RNNs possess a memory in the form of a hidden state, enabling them to remember and leverage past information when making decisions. This makes them particularly effective for tasks like language modeling, time-series forecasting, and sentiment analysis.

0 views
nathan.rs 2 years ago

Classifiers: Generative & Discriminative Models

These are a few of my notes from Eunsol Choi’s NLP class at UT Austin. When it comes to classification, models are broadly categorized into Generative Models and Discriminative Models. In generative models, we aim to model the joint distribution of the data $ p(x, y) $. These models often assume a particular functional form for both $ P(x|y) $ and $ P(y) $. To classify a new data point, we maximize:

0 views
nathan.rs 2 years ago

Probability

My notes over Mark Maxwell’s course, Probability I, and his textbook, Probability & Statistics with Applications, Second Edition. The fundamental theorem of counting is also known as the multiplication principle . Given that there are $N(A)$ outcomes, and for each of these outcomes, there are $N(B)$ outcomes, then the total number of outcomes for the two combined is equal to $N(A)\cdot N(B)$.

0 views
nathan.rs 2 years ago

Linear Algebra

These are my notes over my review of Linear Algebra, going through Gilbert Strang’s Introduction To Linear Algebra. The core of linear algebra is vector addition and scalar multiplication. Combining these two operations gives us a set of linear combinations. $$ c\mathbf{v} + d\mathbf{w} = c\begin{bmatrix} 1 \\ 2 \end{bmatrix} + d\begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} c + 3d \\ 2c + 4d \end{bmatrix}. $$

0 views
nathan.rs 2 years ago

Rust Front-End Development with Dioxus

October 14th, 2025: This post is old and is most likely outdated if you’re reading this! Dioxus has possibly changed a substantial amount, thus do not read this as a how-to-guide. I’ve been using React and Next.js for front-end development ever since high school, it was one of the first few things I learned when it came to programming. Recently, I’ve had the itch to learn something new, specifically Rust front-end. As someone with a “.rs” domain, it felt like an inevitable fate. Finally, I can say I put the “.rs” in the “nathan.rs”.

0 views
nathan.rs 2 years ago

Basic Calculus

A small review over Calculus 1, 2, and 3, based on the textbook, Calculus: Early Transcendentals (Eight Edition). If $f$ and $g$ are both differentiable, then $$\frac{d}{dx}[f(x)g(x)]=f(x)g^\prime(x)+g(x)f^\prime(x)$$ If $f$ and $g$ are differentiable, then $$\frac{d}{dx}\bigg[\frac{f(x)}{g(x)}\bigg]=\frac{g(x)f^\prime(x)-f(x)g^\prime(x)}{[g(x)]^2}$$ If an integral has both an $x$ value and the derivative of that $x$ value, you can use u-substitution. $$\int x x^\prime dx = \int u du$$

0 views
nathan.rs 4 years ago

This Mountain We Climb

This is a poem that I wrote my senior year of high school in AP Literature. Here we all are, this mountain we climb, the sure ascent, that lasts a lifetime, at the golden summit, a goal we all seek the meaning of life, at its Godly peak. Up we should go, a noble direction. Yet why do so many, rebel in rejection Up is worthwhile, this mountain we climb, at the apex is all that’s sublime.

0 views