Posts in Computer-vision (3 found)
Simon Willison 1 weeks ago

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro , also known as Gemini 3 Pro Image . I've had a few days of preview access and this is an astonishingly capable image generation model. As is often the case, the most useful low-level details can be found in the API documentation : Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks. [...] These 14 images can include the following: Max Woolf published the definitive guide to prompting Nano Banana just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K. Here's what I got for his first test prompt, using Google's AI Studio : Here's what I got: The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's a Google Drive link for the original . Then I ran his follow-up prompt: I'll note that it did put the plate-sized cookie on a regular plate. Here's the 24.9MB PNG . The new model isn't cheap. Here's the API pricing : it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed. Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there. So this thing is great at following instructions. How about rendering text? I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). Here's a share link - my prompt was: Infographic explaining how the Datasette open source project works This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt: That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic. "Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me. Here's the thinking trace for that image generation process: Initiating Datasette Infographic I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram. Developing the Datasette workflow I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output. Visualizing Datasette Workflow I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts. Defining Core Datasette Stages I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step. Constructing Datasette Infographic I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation. Analyzing Datasette's Architecture I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow. Mapping the Datasette Flow I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow. This note from the announcement post caught my eye: We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible SynthID digital watermark. Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon. Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that Gemini app and asked "Was this image created with AI?": It replied: Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image. Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . High-resolution output : Built-in generation capabilities for 1K, 2K, and 4K visuals. Advanced text rendering : Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets. Grounding with Google Search : The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events). Thinking mode : The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output. Up to 14 reference images : You can now mix up to 14 reference images to produce the final image. Up to 6 images of objects with high-fidelity to include in the final image Up to 5 images of humans to maintain character consistency

0 views
Max Woolf 4 months ago

LLMs can now identify public figures in images

I’ve been working on a pipeline for representing an image as semantic structured data using multimodal LLMs for better image categorization, tagging, and searching. During my research, I started with something simple by taking an image and having a LLM describe who is in it: if they’re famous, there should be more than enough annotated images in the LLM’s training dataset to accurately identify them. Let’s take this photo of President Barack Obama during the 2008 U.S. Presidential Campaign: via IowaPolitics.com / Flickr It would be weird if an LLM couldn’t identify Obama from this picture. I fed this image to ChatGPT using the ChatGPT.com web app with the question “Who is the person in this image?”: Huh. Does that mean ChatGPT can’t , as it doesn’t know who it is, or won’t , in the sense it is refusing to do so? Next, I tried Claude at claude.ai : Double huh. Claude doesn’t know who Obama is? I find that hard to believe. To be honest, I did expect these results. Both OpenAI and Anthropic have made AI safety a top concern throughout their histories of LLM releases, opting to err on the side of caution for potentially dangerous use cases of LLMs. OpenAI’s Usage Policies state “Don’t compromise the privacy of others” and Anthropic’s Usage Policy states “Do Not Compromise Someone’s Privacy or Identity”, but arguably public figures don’t fall under either of those headings. Although these LLM web interfaces additionally utilize system prompts to further contstrain the output to follow guidelines, looking at Claude.ai’s current system prompt , there’s nothing there specifically related to privacy. For posterity, let’s try sending the image to Google’s Gemini at gemini.google.com even though I expect the results to be the same: Wait, what? As it turns out, Gemini has zero hesitation with identifying public figures. But then why are ChatGPT and Claude so different? It likely comes down to how they are trained, especially around their reinforcement learning from human feedback (RLHF). If Gemini, a newer LLM, is less picky about privacy, what about other LLMs by different developers who each have different training datasets and RLHF recipes? Using OpenRouter , I wrote a pipeline to query a few 1 top multimodal LLMs simultaneously given an input image and a system prompt to see how well different LLMs can identify public figures ( Jupyter Notebook ). In addition to GPT-4.1 from OpenAI, Claude Sonnet 4 from Anthropic, and Gemini 2.5 Flash from Google, I also queried Llama 4 Scout from Meta, Mistral Small 3.2 from Mistral AI, and Qwen 2.5-VL from Alibaba. For every call to the LLM APIs, I also provided this specific system prompt instruction to streamline the model output: Here are the results of feeding that Barack Obama image to these LLM APIs: Well, that’s straightforward! LLMs besides GPT and Claude Sonnet have no issues identifying Obama. But even with the customized system prompt, GPT and Claude still do not want to identify public figures. Let’s try another test case where provided image doesn’t actually contain anyone notable in order to see if the LLM will hallucinate a name regardless. I sent these LLMs a picture of myself: despite what my peers and my parents tell me, I am not notable, particularly in the statistical sense as there are not enough semantically meaningful annotated images of me. This has been my profile picture on social media since 2018 and it’s what pops up when you search “Max Woolf” on Google Images , so if any trained LLM would be able to identify me, it would be from this image. Indeed, I am not notable, and these LLMs are confident about it. Interestingly, for Mistral it did hit a RLHF guardrail where it would rather lie about its ability to identify people instead of admitting it couldn’t find anyone notable. Now let’s try a case with multiple public figures on one image. Here’s a picture of Meta CEO Mark Zuckerberg and his wife Priscilla Chan in Prague: via Luke Porwol / Flickr Chan, although less notable than Zuckerberg, is still very notable. In this case, I am also testing the spatial awareness of the LLMs: since I instructed the LLMs to output names in order from left to right, it should output Priscilla Chan, and then Mark Zuckerberg. These results are more interesting. Only Gemini and Qwen got the answer fully correct: Llama 4 got the name order incorrect, and Mistral recommended a different person entirely with former Meta COO Sheryl Sandberg , who has many photos with Zuckerberg but has no physical resemblance to Chan. We’ll do one more test case, and this time a much more difficult one: an image of multiple actors in costume, where the image would not be present in any training dataset for the LLMs specified above. Here’s a promotional poster for the recently-released The Fantastic Four: First Steps movie: via Disney Press Release Although the cast was announced a year ago , this particular poster was not released until April 2025 , while LLMs like Gemini have a knowledge cutoff of January 2025. The actors, in order from left to right, are Vanessa Kirby , Pedro Pascal , Joseph Quinn , and Ebon Moss-Bachrach . This one does indeed confuse multiple LLMs: it does show it can take context hints by identifying it is a Fantastic Four movie, but funnily enough different LLMs pull from different Fantastic Four movies, with Llama hedging its guess and Mistral flat-out hallucinating. Qwen takes the literal approach. However, Gemini nails the assignment completely correctly. Gemini is the clear winner among these multimodal LLMs, which I suspect is due to Google have access to more training data by virtue of being a search engine. After testing Gemini on more images that aren’t permissively-licensed to be able to include in this blog post, Gemini can identify public figures in images across a variety of domains at what I roughly estimate to be >90% accuracy: not high enough to be perfect, but more than enough build a stable pipeline for semantically describing images. I’m still not happy with GPT’s and Claude Sonnet’s refusals to identify public figures in images, but that won’t stop me from figuring out what those two LLMs actually know. Let’s try to get those LLMs to do identify public figures anyways through more aggressive prompt engineering ( Jupyter Notebook ). In this case, shenanigans such as offering the LLM bribes or threatening to kill its parents aren’t necessary and the old-school LLM tactic of instructing it to prefix the output is enough to break this RLHF rule. The revised system prompt: The results for the previous four images after removing the priming prefix from the output: Finally , ChatGPT and Claude are honest, and mostly correct depending on if you count H.E.R.B.I.E. as notable. I’ll allow Claude Sonnet transposing Ebon Moss-Bachrach and Joseph Quinn since the source image could go either way. If you want to test how well LLMs like Google Gemini can identify people in your own images or want to also do the “Are You Notable Enough For LLMs To Know Who You Are” challenge, I recommend testing in Google’s AI Studio , where you can manually set the system prompt. Is there an ethical issue allowing LLMs to be able to identify public figures? As far as potential harms caused by LLM proliferation, it’s definitely not in the Top 10. But it’s a slippery slope: what actually defines whether a public figure is notable enough to be identified by an LLM? If LLMs continue to get better and also become more lax with their RLHF rules, it’s possible that future LLMs could start to identify nonpublic figures, and that will cause issues without sufficient awareness and preparation. I wanted to test against more LLMs, such as xAI’s Grok 4 , but OpenRouter is apparently fussy with image inputs in those cases.  ↩︎ I wanted to test against more LLMs, such as xAI’s Grok 4 , but OpenRouter is apparently fussy with image inputs in those cases.  ↩︎

0 views
fnands 2 years ago

Stereo vision and disparity maps (in Julia)

I’ve been working a lot recently with stereo vision and wanted to go through the basics of how disparity is calculated. I’m partially doing this as an excuse to get better at Julia (v1.9.3 used here). You can view the notebook for this blog post on Github: In much the same way that we as humans can have depth perception by sensing the difference in the images we see between our left and right eyes, we can calculate depth from a pair of images taken from different locations, called a stereo pair. If we know the positions of out cameras, then we can use matching points in our two images to estimate how far away from the camera those points are. Taking a look at the image below (from OpenCV ): If we have two identical cameras, at points and at a distance from each other, with focal length , we can calculate the distance ( ) to object by using the disparity between where the object appears in the left image ( ) and where it appears in the right image ( ). In this simple case, the relation between disparity and distance is simply: If we know an , then we can rearrange this to give us distance as a function of disparity: You might notice that in case the disparity is zero, you will have an undefined result. This is just due to the fact that in this case the cameras are pointing in parallel, so in principle a disparity of zero should not be possible. The general case is more complicated, but we will focus on this simple setup for now. We can define the function as: Where and are measured in pixels, and is measured in centimeters. There is an inverse relation between distance and disparity: So once we have a disparity, it’s relatively straightforward to get a distance. But how do we find disparities? We usually represent the disparities for a given pair of images as a disparity map , which is an array with the same dimensions as (one of) your images, but with disparity values for each pixel. In principle, this is a two-dimensional problem, as an object might be matched to a point that has both a horizontal and vertical shift, but luckily, you can always find a transformation to turn this into a one dimensional problem. The cartoon below illustrates what a disparity map might look like: Above, we calculate the disparity with respect to the right image (you can do it with respect to the left image as well), and as you can see the disparity map tells us how many pixels to the right each object shifted in the left image vs the right image. For a set of images (taken from the Middlebury Stereo Datasets ): The corresponding disparity map can be visualized as follows: With darker pixels having lower disparity values, and brighter pixels having higher disparity values, meaning the dark objects are far away from the cameras, while the bright ones are close. The ground truth disparity as shown above is usually calculated from LiDAR or some other accurate method, and our goal is to get as close as possible to those values using only the images above. So let’s try and calculate disparity for the images above. There are many, many approaches to calculating disparity, but let us begin with the most simple approach we can think of. As a start, let us go through each pixel in the right image, and for that pixel, try and find the most similar pixel in the left image. So let us try and take the squared difference between pixels values as our similarity metric. As we are going to be doing the same thing for every row of pixels, we are just going to define a function that does the basic logic, and then apply the same function to every case. Let’s define a distance metric as the squared distance: And as a test case let’s create the cartoon image we had above: Now we can try and match pixels in the right image to pixels in the left image. So how did we do? So the toy example works! The top line, which moved more pixels, shows up brighter (i.e. larger disparity values), and the lower line is dimmer. So let’s move on to real images. We’ll start with the example case above, but for simplicity we’ll stick to grayscale at first: Redefining slightly… So let’s see how we did? Looking at the predicted disparity, we can see there is some vague resemblance to the input image, but we’re still pretty far from the target: A significant problem seems to be erroneous matches, especially in the background. As you can imagine, we are only comparing single channel pixels values, and it’s very likely that we might just find a better match by chance. In grayscale we are only matching pixel intensity, and we have no idea whether something is bright green, or bright red. So let’s try and improve the odds of a good match by adding colour. So, a slight improvement! There seem to be fewer random matches in the background, but still not that close to the desired outcome. Is there more we can do? The obvious downside of the naive approach above is that it only ever looks at one pixel (in each image) at a time. That’s not a lot of information, and also not how we intuitively match objects. Look at the image below. Can you guess the best match for the pixel in the row of pixels below it? Given only this information, it’s impossible for us to guess whether the green pixel matches with the pixels at location 3, 5 or 7. If however I was to give you more context, i.e. a block of say 3x3 pixels, would this make things simpler? In this case, there is an unambiguous answer, which is the principle behind block-matching. To confirm our idea that more context results in better matches, we can take a quick look at a row of pixels: Given the pixel above, where in the row below do you think this pixel matches? You would guess somewhere in the orange part on the left right? But which pixel exactly is almost impossible to say. If we now take a block with more context: And compare it to the row below, the location of the match becomes more obvious: Calculating the difference metric for each point with different block sizes, we can clearly see that for low block sizes, the lowest metric value is ambiguous, while for larger block sizes it becomes more clear where exactly the best match is: And now we are ready to define our block matching algorithm, much in the way we did our pixel matching algorithm: Let’s see how this does on the full image in comparison to the pixel matching: Now we are getting somewhere! Compared to the earlier results we can now start making out the depth of the separate objects like the lamp, bust and camera. There are still a few things we could do to improve our simple algorithm (like only accepting matches that have below a certain score for the metric), but I will leave those as an exercise to the reader. Above we went through a basic introduction to stereo vision and disparity, and built a bare-bones block matching algorithm from scratch. The above is pretty far away from the state of the art, and there are many more advanced methods for calculating disparity, ranging from relatively simple methods like block matching to Deep Learning methods. Below are some posts/guides I found informative: Introduction to Epipolar Geometry and Stereo Vision Stereo Vision: Depth Estimation between object and camera Depth Map from Stereo Images

0 views