Leland McInnes
A Mathematician dabbling in Data Science, especially unsupervised learning and data exploration. UMAP, HDBSCAN, PyNNDescent, DataMapPlot. (He/Him)
- Reposted by Leland McInnesOur new pre-print shows how unsupervised clustering methods can identify biologically meaningful differences in early vocal production, with no human feedback. @antorrisi.bsky.social has led this interdisciplinary collaboration based on computational methods + #chicks 🐣 arxiv.org/abs/2601.12203
- Reposted by Leland McInneshere's a fun side project i've been working on: i compiled a joint text<>audio embedding model to a fast coreml pipeline, and built a very fast (~400ms for 50k samples, can scale to millions) UMAP dimensionality reduction GPU impl in mlx. using it to browse music libraries and do sample sim search
- Reposted by Leland McInnesXiaobin Li, Run Zhang: Understanding and Improving UMAP with Geometric and Topological Priors: The JORC-UMAP Algorithm arxiv.org/abs/2601.16552 arxiv.org/pdf/2601.16552 arxiv.org/html/2601.16552
- Reposted by Leland McInnesI miss the days where you'd see blogposts with clever analyses on datasets, maths and data science tricks. That's why, as an experiment, we're starting a new moderated subreddit. People can share/promote their notebooks and you can use RSS to subscribe. Please join and share!
- Reposted by Leland McInnesUMAP connectivity plots of 3,627 chess openings from the @lichess.org datasets (huggingface.co/datasets/Lic...)
- Reposted by Leland McInnesI think it's important to note though that in spite of those incentives, the direction of the last two years has been more fungibility, *not* lock-in. And open source is the wrong fight here: when lock-in comes it will look more like the lock-in that Amazon or Uber have than Microsoft Office…
- Reposted by Leland McInnesNew preprint! Have you ever wondered, what are these fuzzy simplicial sets, the theoretical framework behind e.g. UMAP? Here we show that you may simply see them as marginal distributions over simplicial sets. This provides a generative model for UMAP. (1/2) arxiv.org/abs/2512.03899
- Reposted by Leland McInnes[This post could not be retrieved]
- Reposted by Leland McInnesSpace DJ turns genre embeddings into a playable galaxy—pilot a ship, the music follows. 🚀 Key stats 768→128 PCA compression; 3D UMAP projection; three.js rendering; autopilot drift; high‑dim neighbors surfacing hidden similarities.
- Reposted by Leland McInnesvia the magic of laion_clap embeddings and umap, my live coding thingy has a sample browser at last!
- Reposted by Leland McInnesI made this annotated scatter plot of 1 million FineWeb-Edu documents for @sashamtl.bsky.social's new TED talk.
- Reposted by Leland McInnesAlso really love how organic the plot looks with "inferno" (left) and "viridis" (right).
- Reposted by Leland McInnesMap of the internet: 1.3M nodes (BGP)
- The video of my talk at SciPy on DataMapPlot is up at last. If you make t-SNE or UMAP plots the talk provides some guidance on how to make plots most effective, and introduces a library to help make that easier. www.youtube.com/watch?v=-iBh...
- Reposted by Leland McInnesDespite the gutting of the National Center for Educational Statistics, the dept of Ed *did* manage to release 2024 college major counts in the usual format, so I can run it through the same code I do every year. First off, the change since peak of the largest fields -- another year of drops.
- Reposted by Leland McInnesI'm very much a learner, but you're maybe asking if aspects of matrix factorisation approaches to dimensionality reduction apply here. But LocalMAP is a KNN approach, with a matrix factorisation initialisation. h/t @lelandmcinnes.bsky.social for his attempts to describe these youtu.be/9iol3Lk6kyU
- Reposted by Leland McInnes📢 Save the date! Join us for the next @ellis.eu x UniReps Speaker Series! 📅 27th August – 16:00 CEST 📍https://ethz.zoom.us/j/66426188160 🎙️ Speakers: Keynote by @lelandmcinnes.bsky.social & Flash Talk by Yu (Demi) Qin 🔔 Stay updated by joining our Google group: groups.google.com/u/2/g/ellis-...
- Reposted by Leland McInnes🚀 We've just open-sourced Embedding Atlas – a tool for exploring large embedding spaces through rich, interactive visualizations 📊.
- Reposted by Leland McInnesMeteoroid stream identification with HDBSCAN unsupervised clustering algorithm. Eloy Peña-Asensio et. al. https://arxiv.org/abs/2507.01501
- Reposted by Leland McInnesEver wanted to pan through the latent🌌 space of TikTok videos? Made using the amazing toponymy and datamapplot from @lelandmcinnes.bsky.social and data from mine and @jurgenpfeffer.bsky.social 's first complete TikTok slice. link below
- Reposted by Leland McInnes🎤 Speaker Spotlight: Leland McInnes Join Leland at #SciPy2025 for his talk "DataMapPlot: Rich Tools for UMAP Visualizations." 📊 Discover powerful new ways to explore high-dimensional data! 🔗 scipy2025.scipy.org
- I'll be giving a talk about DataMapPlot for visualizing data maps at Scipy this year. I would love to meet potential users and chat about where to go next. cfp.scipy.org/scipy2025/ta...
- Reposted by Leland McInnesOMG I am so glad someone finally did this. Thank you 🙏 @lelandmcinnes.bsky.social This will now consume hours and hours of my time. lmcinnes.github.io/datamapplot_...
- Explore Wikipedia through a data map. Pages are grouped by semantic similarity, for topic clusters. Hover to see details, zoom to explore more fine-grained topics, click to go to a page. Search by page name to find interesting starting points for exploration. lmcinnes.github.io/datamapplot_...
- I also updated the ArXiv data map example to make use of new features in datamapplot. lmcinnes.github.io/datamapplot_... You can tweak parameters and build your own version: gist.github.com/lmcinnes/e11...
- Reposted by Leland McInnesGreat idea. Did no one think of this before?
- Explore Wikipedia through a data map. Pages are grouped by semantic similarity, for topic clusters. Hover to see details, zoom to explore more fine-grained topics, click to go to a page. Search by page name to find interesting starting points for exploration. lmcinnes.github.io/datamapplot_...
- Explore Wikipedia through a data map. Pages are grouped by semantic similarity, for topic clusters. Hover to see details, zoom to explore more fine-grained topics, click to go to a page. Search by page name to find interesting starting points for exploration. lmcinnes.github.io/datamapplot_...
- All of this is really just a tech-demo for the tools backing it: Toponymy for creating topics and topic labels, and DataMapPlot for creating the interactive visualization. github.com/TutteInstitu... github.com/TutteInstitu...
- It does provide a novel way to explore Wikipedia though. You can see the scope of all of English language Wikipedia at once. There are surprising clusters (Every Polish village; Japanese railway stations; etc.), dense topics, and surprising connections to be found.
-
View full threadFor even more wikipedia vectors Nomic.ai just released vectorization and a data map for all of Wikipedia in all languages! enterprise.wikimedia.com/blog/nomic-a... huggingface.co/datasets/nom...
- Reposted by Leland McInnes🔥 Meet our Keynote Speakers for #SciPy2025! Dr Malvika Sharan, co-Director of Open Life Science (OLS) and a senior researcher at The Alan Turing Institute will be sharing with us her expertise at our favorite conference. You can't miss her ➡️ hubs.la/Q03sdlsb0
- Reposted by Leland McInnes🔥 Meet our Keynote Speakers for #SciPy2025! Hon. Dr. Kathryn D. Huff 🇺🇸, nuclear engineer, policy leader, and former Assistant Secretary for the Office of Nuclear Energy will be joining us in Tacoma! 🙌 Don't miss her talk, grab your ticket now: hubs.la/Q03sdlsb0
- Reposted by Leland McInnesNature Reviews Methods Primers: Uniform manifold approximation and projection (UMAP) www.nature.com/articles/s43... 🧬🖥️🧪 read free: rdcu.be/d0YZT
- Reposted by Leland McInnesOur latest paper is out: peerj.com/articles/cs-.... We added functionality to the #HDBSCAN clustering algorithm to also detect branches hdbscan.readthedocs.io/en/latest/ho... #eda #datavis #clustering
- Reposted by Leland McInnesOriginal paper: www.nature.com/articles/s41...
- Reposted by Leland McInnesNo microscope? No problem! A new spatial transcriptomics method developed at the Broad lets you map gene expression with zero imaging needed. Perfect for big tissue samples and small labs! 🔬💡 #spatialtranscriptomics #bioinformatics #transcriptomics #genomics #biology #harvard #broadinstitute
- Reposted by Leland McInnesIf you want to annotate quickly ... embeddings/UMAP/live interaction can be your best friends! Just embed everything in two dimensions and inspect the clusters, annotating as you go along.
- Reposted by Leland McInnesSure there's more than a couple dozen… But why not? Burrows Wheeler transform was just a random DEC technical report in 1994. Today that would mean just posting to arxiv and never bother w/ peer review. Actually, even more on point that's exactly what UMAP is -- arxiv.org/abs/1802.03426
- Reposted by Leland McInnesAnnouncing BERTopic v0.17 🥳 This is a feature-packed update that includes the amazing Model2Vec, more interactive DataMapPlot functionalities, a method for lightweight installation, and much more!
- Reposted by Leland McInnes1. Struggling to integrate single-cell datasets? Finding it hard to resolve clear differentiation trajectories? Reveal the underlying structure in your data with CONCORD.
- Reposted by Leland McInnesDid #ColliderFest today on the Viper HPC stand! It was fun talking to people about supercomputing/HPC and AI, but also exhausting. colliderfest.co.uk Used my UMAP word cloud demo to talk about how foundational language models (e.g. like ChatGPT, Ollama, etc) work, and how […]
- Reposted by Leland McInnesTo learn in depth about UMAP + tSNE + PCA, please join my Instats seminar: Linear and Nonlinear Dimensionality Reduction instats.org/seminar/line... on the 26th of February 2025
- Reposted by Leland McInnesA note on the relations between mixture models, maximum-likelihood and entropic optimal transport As they note, this is an interesting summary written for clarity rather than a new result. But, I've long tried to convince people that all of ML is deeply interconnected. arxiv.org/abs/2501.12005
- Reposted by Leland McInnesLovely example of how weird real-world data is. I put all 55 million points in Overture maps into a Nomic Atlas map. Lots of points spuriously appearing in the ocean. But the 1° square around 'null island' (0° lat, 0° long) is suspiciously clear -- someone has cleared out bad data just there.
- Reposted by Leland McInnesExploring the hidden states of #ModernBERT with UMAP umap-learn.readthedocs.io/en/latest/
- Reposted by Leland McInnesarxiv.org/abs/2412.10924 - Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning. by Julia Witte Zimmerman et al. from 1Computational Story Lab at UVM Super fun figures!
- Reposted by Leland McInnesreally interesting SAE project with Llama (with api and data map) from Goodfire — including interesting research areas for steering vs factuality www.goodfire.ai/papers/mappi...
- Reposted by Leland McInnesLearn how to compute embeddings with @motherduck.com and @marimo.io, visualize them with UMAP, and host the resulting notebook on @hf.co — all in our latest blog!
- Reposted by Leland McInnesHi NeurIPS! Explore ~4,500 NeurIPS papers in this interactive visualization: jalammar.github.io/assets/neuri... (Click on a point to see the paper on the website) Uses @cohere.com models and @lelandmcinnes.bsky.social's datamapplot/umap to help make sense of the overwhelming scale of NeurIPS.
- Reposted by Leland McInnesSpent the day playing with this. I'm absolutely blown away @enjalot.bsky.social! - Chose any embedding from HF - Project with UMAP, cluster with HDBSCAN - Use Ollama to label the clusters (Works incredibly well!)
- Reposted by Leland McInnesA first pass at UMAP on a sphere! The scikit-learn digit dataset, embedded with Python, visualized with JS, packaged & deployed with Observable Framework: pamacha.observablehq.cloud/spherical-um...
- Reposted by Leland McInnesNew blog post! Updated for 2024, my favorite example of why alphabetical ordering is bad for geographic features -- US presidential results since 1828. The left image shows regional patterns in a geographic ordering that the right (alphabetical) simply loses. benschmidt.org/post/2024-11...
- Reposted by Leland McInnesGood, published, benchmarks of machine learning / data science is crucial. But so hard. Well-cited "SOTA" methods typically crash often. They tend to be very computational expensive. Both make a systematic study impossible. Finally, reviewers always ask for more methods, and more "SOTA".
- Reposted by Leland McInnesI've organized and participated in many unconferences in the past, and they are always the most intense exchange of ideas and information that I've experienced. Given the energy we're seeing in the registration this one is poised to be no different! register today! hiddenstates.org
- Reposted by Leland McInnesWe've hit a critical mass of registrations! The caliber of attendees is exciting, we've got researchers from companies big and small, academic and indie. We've got prototypers and UXers who have worked on bleeding-edge interfaces as well as house-hold names. let's talk about the unconf experience:
- Reposted by Leland McInnesHidden States is happening next week in SF! It's a one-day unconference gathering researchers, designers, prototypers and engineers interested in pushing the boundaries of AI interfaces, going below the API and working with the hidden states. hiddenstates.org