See full post

Daniel van Strien

danielvanstrien.bsky.social

Followers · Following

Machine Learning Librarian at @hf.co

Joined May 2023

Posts Replies Media Original posts Likes Feeds Lists

Reposted by Daniel van Strien
IIIF Consortium iiif.bsky.social · Feb 3
Join us Feburary 11 for a demo of @danielvanstrien.bsky.social's IIIF Illustration Detector. Zoom on the IIIF Community Calendar: iiif.io/community

View on Bluesky Download image Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Feb 2
Built an object detector from zero-labelled data in one afternoon with help from Claude Code (it can do more than vibe code, TODO apps...) SAM3 on HF Jobs → correct the errors → train YOLO → repeat. Three rounds: 31% → 99% accuracy on historical index cards from @natlibscot.bsky.social

View on Bluesky Download image Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Feb 2
Model: huggingface.co/davanstrien/archival-index-card-detector SAM3 script: huggingface.co/datasets/uv-...
davanstrien/archival-index-card-detector · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Reposted by Daniel van Strien
Paul Fairie paulisci.bsky.social · Jan 12
We used to do real science

View on Bluesky Download image Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Dec 19, 2025
Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code. I used a dataset I labelled in 2022 and left on @hf.co for 3 years 😬. It finds illustrated pages in historical books. No server. No GPU.

View on Bluesky Download video Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Dec 19, 2025
Paste any IIIF manifest → model classifies every page locally → see where illustrations appear. Part of small-models-for-glam: small, efficient models for cultural heritage work. Not everything needs GPT-4! Try it: huggingface.co/spaces/small-models-for-glam/iiif-illustration-detector
IIIF Illustration Detector - a Hugging Face Space by small-models-for-glam

Find illustrated pages in digitized historical books

huggingface.co

View on Bluesky Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Dec 19, 2025
cc @glenrobson.bsky.social! Finally got time to play with transformers.js and @iiif.bsky.social!

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Dec 9, 2025
Just posted my slides from the AI4LAM #FF2025 workshop on open source AI for GLAMs. Probably slides on their own aren't that useful, but they do feature one of my growing collection of libraries-and-AI memes, so there's that danielvanstrien.xyz/slides.html

View on Bluesky Download image Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Dec 1, 2025
At the AI4LAM Fantastic Futures conference this week Happy to chat about @hf.co, open source AI for GLAMs, or why cultural heritage should bet on small, focused models over closed-source giants! DM or find me at breaks! #AI4LAM #FF2025

View on Bluesky Show all post labels

Reposted by Daniel van Strien
Ai2 ai2.bsky.social · Nov 28, 2025
[This post could not be retrieved]

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Nov 21, 2025
Building datasets to train smaller, task-focused models used to be incredibly time-consuming. Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically! Try it yourself: huggingface.co/datasets/uv-...!

View on Bluesky Download image (1)Download image (2)Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Nov 21, 2025
Explore some results here: huggingface.co/spaces/uv-sc....
SAM3 Detection Browser - a Hugging Face Space by uv-scripts

Explore images and their detected objects using the SAM3 model. Enter a dataset ID, select a split, adjust confidence thresholds, and view detailed object detections with bounding boxes.

huggingface.co

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Nov 5, 2025
Very much looking forward to presenting at this tomorrow. I will be making my usual pitch that datasets are the foundational infrastructure for cultural heritage to benefit from and create useful AI models and tools. Be warned, I did fire up the meme generator for my slides...
- BRAID UK braiduk.bsky.social · Oct 8, 2025
  New forum ‘Responsible AI and Cultural Heritage Forum’ on Thu 6 Nov 2025, Senate House (hybrid) from BRAID Fellow @amsichani.bsky.social with @nationalarchives.gov.uk.web.brid.gy Find out more/register here: bit.ly/4mUdESC
  BRAID Responsible AI and Cultural Heritage forum
  
  This forum brings together the cultural heritage and research community to discuss current and future challenges of embracing AI in cultural heritage responsibly and ethically.
  
  bit.ly
View on Bluesky Download image Show all post labels

Reposted by Daniel van Strien
William J.B. Mattingly wjbmattingly.bsky.social · Oct 24, 2025
Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Oct 22, 2025
DeepSeek-OCR just got vLLM support 🚀 Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command. Processing at ~350 images/sec on A100 Using @hf.co Jobs + uv - zero setup batch OCR! Will share final time + cost when done!

View on Bluesky Download image Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Oct 22, 2025
The command (using @hf.co Jobs - serverless GPU compute) Full script at huggingface.co/datasets/uv-...

View on Bluesky Download image Show all post labels

Reposted by Daniel van Strien
Tom Aarsen tomaarsen.com · Oct 22, 2025
🤗 Sentence Transformers is joining @hf.co! 🤗 This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer! Details in 🧵

View on Bluesky Download image Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Oct 22, 2025
OCR is one of AI's oldest challenges (first systems: early 1900s!) Modern vision-language models have transformed what's possible: handwriting, 100+ languages, math formulas, tables, signature extraction... New @hf.co guide on OCR huggingface.co/blog/ocr-ope...
Supercharge your OCR Pipelines with Open Models

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Oct 16, 2025
Small models work great for GLAM but there aren't enough examples! With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases. Follow the org to keep up-to-date! huggingface.co/small-models...

View on Bluesky Download video Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Oct 15, 2025
Very nice work! IMO, this is the kind of topic that more libraries/GLAM/DH people should be working on. The training of these models is *relatively* simple. As always, the missing ingredient is readily accessible data.
- Thibault Clérice ponteineptique.bsky.social · Oct 15, 2025
  It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) ! 📚 2.5bn tokens of mostly Latin and French texts 🕰️ 800→1600 CE 📜 23k manuscripts 🖥️ 18k on the reading interface: comma.inria.fr 🔍 Paper: inria.hal.science/hal-05299220v1 (1/🧵)
  CoMMA
  
  comma.inria.fr
View on Bluesky Show all post labels

Reposted by Daniel van Strien
Thibault Clérice ponteineptique.bsky.social · Oct 15, 2025
It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) ! 📚 2.5bn tokens of mostly Latin and French texts 🕰️ 800→1600 CE 📜 23k manuscripts 🖥️ 18k on the reading interface: comma.inria.fr 🔍 Paper: inria.hal.science/hal-05299220v1 (1/🧵)
CoMMA

comma.inria.fr

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Oct 13, 2025
Another week, another VLM-based OCR model! Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄 You can run it with one command on @hf.co Jobs (no local GPU needed)

View on Bluesky Download image Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Oct 13, 2025
Try it: huggingface.co/datasets/uv-... Example output: huggingface.co/datasets/dav...

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Oct 7, 2025
DoTS.ocr just got native vLLM support! I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs Tested on 1800s library cards - works great ✨

View on Bluesky Download image (1)Download image (2)Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Oct 7, 2025
Built with uv for zero setup Example output from historical library catalog: huggingface.co/datasets/dav... Input dataset: huggingface.co/datasets/big... 100+ languages supported!
davanstrien/dots-ocr-bpl-card-catalog · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Oct 2, 2025
New @hf.co BigLAM dataset: 9,363 OA books with page images + rich MARC metadata for evaluating (and training) VLMs on metadata extraction. Libraries are starting to explore AI-assisted cataloguing, but we lack public evaluation data. Hoping this helps fill that gap. huggingface.co/datasets/big...

View on Bluesky Download image Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Oct 6, 2025
Also uploaded related datasets for index cards bsky.app/profile/dani...
- Daniel van Strien danielvanstrien.bsky.social · Oct 6, 2025
  Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that. I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.
View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Oct 6, 2025
Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that. I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.

View on Bluesky Download image Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Oct 6, 2025
huggingface.co/collections/...
Index card datasets - a biglam Collection

Index card datasets for training and evaulating models for conversion of index cards to structured data/metadata

huggingface.co

View on Bluesky Show all post labels

Reposted by Daniel van Strien
Jay 🦋 jay.bsky.team · Oct 1, 2025
We’re hiring for two machine learning roles. A chance to do cutting edge things with ML to make this place a lot more personalized. jobs.gem.com/bluesky/am9i...
Bluesky Jobs

Bluesky Jobs

jobs.gem.com

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Sep 4, 2025
Blogged: Fine-tuning a VLM for art history in hours, not weeks iconclass-vlm generates museum catalog codes (fun fact: "71H7131" = "Bathsheba with David's letter"!) @hf.co TRL + Jobs = magic ✨ Guide here: danielvanstrien.xyz/posts/2025/i...
https://danielvanstrien.xyz/posts/2025/iconclass-vlm-sft/trl-vlm-fine-tuning-iconclass.html

danielvanstrien.xyz

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Sep 3, 2025
I fine-tuned a smol VLM to generate specialized art history metadata! iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!) Trained with @hf.co TRL + Jobs - single UV script, no GPU needed! Blog soon!

View on Bluesky Download image Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Sep 3, 2025
Model: huggingface.co/davanstrien/... Space to explore predictions on a test set: huggingface.co/spaces/davan...
davanstrien/iconclass-vlm · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Sep 3, 2025
cc @epoz.org!

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Aug 7, 2025
What if OCR models could show you their thought process? NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first. Could be pretty valuable for weird historical documents? Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...

View on Bluesky Download image Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Aug 7, 2025
Model here: huggingface.co/numind/NuMar...
numind/NuMarkdown-8B-Thinking · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Aug 7, 2025
Try it with one line of code via Jobs! It processes images from any dataset and outputs a new dataset with extracted markdown - all using HF GPUs. See the full OCR uv scripts collection: huggingface.co/datasets/uv-...

View on Bluesky Download image Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Aug 6, 2025
You can now generate synthetic data using OpenAIs GPT OSS models on @hf.co Jobs! One command, no setup: hf jobs uv run --flavor l4x4 [script-url] \ --input-dataset your/dataset \ --output-dataset your/output Works on L4 GPUs ⚡ huggingface.co/datasets/uv-...
uv-scripts/openai-oss · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Reposted by Daniel van Strien
tobyhodges tobyhodges.carpentries.org · Aug 5, 2025
[Not loaded yet]

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Aug 5, 2025
I’m continuing my experiments with VLM-based OCR… How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social? RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!) @hf.co Demo: huggingface.co/spaces/davan...

View on Bluesky Download image Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Aug 1, 2025
Many VLM-based OCR models have been released recently. Are they useful for libraries and archives? I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social huggingface.co/spaces/davanstrien/ocr-time-capsule

View on Bluesky Download image Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Aug 1, 2025
Based on this great collection: huggingface.co/datasets/NationalLibraryOfScotland/Scottish-School-Exam-Papers You can browse visually (press V!), see quality metrics, and compare outputs side-by-side.
NationalLibraryOfScotland/Scottish-School-Exam-Papers · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Aug 1, 2025
I'm planning to add more example datasets & OCR models using HF Jobs. Feel free to suggest collections to test with: I need image + existing OCR! Even better: upload your GLAM datasets to @hf.co! 🤗

View on Bluesky Show all post labels

Reposted by Daniel van Strien
Tim Kellogg timkellogg.me · Jul 30, 2025
qwen3-30b-a3b-thinking-2507 no surprise, but today Qwen launched the thinking version of its laptop-sized MoE tasty, as usual. huggingface.co/Qwen/Qwen3-3...

View on Bluesky Download image Show all post labels

Reposted by Daniel van Strien
Tim Kellogg timkellogg.me · Jul 29, 2025
yesssss! a small update to Qwen3-30B-A3B this has been one of my favorite local models, and now we get an even better version! better instruction following, tool use & coding. Nice small MoE! huggingface.co/Qwen/Qwen3-3...

View on Bluesky Download image Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Jul 29, 2025
HF Jobs just launched! 🚀 One command VLM based OCR with uv Scripts: hf jobs uv run [script] ufo-images ufo-text Classified UFO docs → clean markdown. Zero setup! Try it → huggingface.co/datasets/uv-...

View on Bluesky Download image (1)Download image (2)Show all post labels

Reposted by Daniel van Strien
William J.B. Mattingly wjbmattingly.bsky.social · Jul 28, 2025
Working to port VLaMy to an entirely free mode where you can just cache all your data in the browser for a project. Slowly adding all the features from the full version to this user-free version. Available now on @hf.co @danielvanstrien.bsky.social Link: huggingface.co/spaces/wjbma...

View on Bluesky Download image Show all post labels

Reposted by Daniel van Strien
jsulz handle.invalid · Jul 15, 2025
We've moved the first 20PB from Git LFS to Xet on @hf.co without any interruptions. Now we're migrating the rest of the Hub. We got this far by focusing on the community first. Here's a deep dive on the infra making this possible and what's next: huggingface.co/blog/migrati...
Migrating the Hub from Git LFS to Xet

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Jul 8, 2025
465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed! huggingface.co/blog/davanst...

View on Bluesky Download video Show all post labels
Daniel van Strien danielvanstrien.bsky.social · Jul 8, 2025
huggingface.co/datasets/dat...
data-is-better-together/fineweb-c · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Jul 7, 2025
ShotBench: Cinematic Understanding Benchmark - 3,572 expert QA pairs - 3,049 images + 464 videos - 200+ Oscar-nominated films - 8 cinematography dimensions tested huggingface.co/datasets/Vch...
Vchitect/ShotBench · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Daniel van Strien danielvanstrien.bsky.social · Jun 30, 2025
Added olmOCR to the OCR Time Machine! @ai2.bsky.social's olmOCR (one of the OG VLM-based OCR models) still performs v well. My takeaway from testing: there's no single "best" VLM for historical docs currently (maybe with a bit of fine-tuning, there could be 😉) huggingface.co/spaces/davan...
OCR Time Machine - a Hugging Face Space by davanstrien

This tool extracts text from historical document images and corresponding XML files. You upload an image and optionally an XML file, choose an OCR model, and get the extracted text in both Markdown...

huggingface.co

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙