Daniel van Strien
Machine Learning Librarian at @hf.co
- Reposted by Daniel van StrienJoin us Feburary 11 for a demo of @danielvanstrien.bsky.social's IIIF Illustration Detector. Zoom on the IIIF Community Calendar: iiif.io/community
- Built an object detector from zero-labelled data in one afternoon with help from Claude Code (it can do more than vibe code, TODO apps...) SAM3 on HF Jobs → correct the errors → train YOLO → repeat. Three rounds: 31% → 99% accuracy on historical index cards from @natlibscot.bsky.social
- Model: huggingface.co/davanstrien/archival-index-card-detector SAM3 script: huggingface.co/datasets/uv-...
- Reposted by Daniel van StrienWe used to do real science
- Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code. I used a dataset I labelled in 2022 and left on @hf.co for 3 years 😬. It finds illustrated pages in historical books. No server. No GPU.
- Paste any IIIF manifest → model classifies every page locally → see where illustrations appear. Part of small-models-for-glam: small, efficient models for cultural heritage work. Not everything needs GPT-4! Try it: huggingface.co/spaces/small-models-for-glam/iiif-illustration-detector
- cc @glenrobson.bsky.social! Finally got time to play with transformers.js and @iiif.bsky.social!
- Just posted my slides from the AI4LAM #FF2025 workshop on open source AI for GLAMs. Probably slides on their own aren't that useful, but they do feature one of my growing collection of libraries-and-AI memes, so there's that danielvanstrien.xyz/slides.html
- Reposted by Daniel van Strien[This post could not be retrieved]
- Building datasets to train smaller, task-focused models used to be incredibly time-consuming. Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically! Try it yourself: huggingface.co/datasets/uv-...!
- Explore some results here: huggingface.co/spaces/uv-sc....
- Very much looking forward to presenting at this tomorrow. I will be making my usual pitch that datasets are the foundational infrastructure for cultural heritage to benefit from and create useful AI models and tools. Be warned, I did fire up the meme generator for my slides...
- New forum ‘Responsible AI and Cultural Heritage Forum’ on Thu 6 Nov 2025, Senate House (hybrid) from BRAID Fellow @amsichani.bsky.social with @nationalarchives.gov.uk.web.brid.gy Find out more/register here: bit.ly/4mUdESC
- Reposted by Daniel van StrienOver the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.
- DeepSeek-OCR just got vLLM support 🚀 Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command. Processing at ~350 images/sec on A100 Using @hf.co Jobs + uv - zero setup batch OCR! Will share final time + cost when done!
- The command (using @hf.co Jobs - serverless GPU compute) Full script at huggingface.co/datasets/uv-...
- Reposted by Daniel van Strien🤗 Sentence Transformers is joining @hf.co! 🤗 This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer! Details in 🧵
- OCR is one of AI's oldest challenges (first systems: early 1900s!) Modern vision-language models have transformed what's possible: handwriting, 100+ languages, math formulas, tables, signature extraction... New @hf.co guide on OCR huggingface.co/blog/ocr-ope...
- Small models work great for GLAM but there aren't enough examples! With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases. Follow the org to keep up-to-date! huggingface.co/small-models...
- Very nice work! IMO, this is the kind of topic that more libraries/GLAM/DH people should be working on. The training of these models is *relatively* simple. As always, the missing ingredient is readily accessible data.
- It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) ! 📚 2.5bn tokens of mostly Latin and French texts 🕰️ 800→1600 CE 📜 23k manuscripts 🖥️ 18k on the reading interface: comma.inria.fr 🔍 Paper: inria.hal.science/hal-05299220v1 (1/🧵)
- Reposted by Daniel van StrienIt's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) ! 📚 2.5bn tokens of mostly Latin and French texts 🕰️ 800→1600 CE 📜 23k manuscripts 🖥️ 18k on the reading interface: comma.inria.fr 🔍 Paper: inria.hal.science/hal-05299220v1 (1/🧵)
- Another week, another VLM-based OCR model! Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄 You can run it with one command on @hf.co Jobs (no local GPU needed)
- Try it: huggingface.co/datasets/uv-... Example output: huggingface.co/datasets/dav...
- DoTS.ocr just got native vLLM support! I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs Tested on 1800s library cards - works great ✨
- Built with uv for zero setup Example output from historical library catalog: huggingface.co/datasets/dav... Input dataset: huggingface.co/datasets/big... 100+ languages supported!
- New @hf.co BigLAM dataset: 9,363 OA books with page images + rich MARC metadata for evaluating (and training) VLMs on metadata extraction. Libraries are starting to explore AI-assisted cataloguing, but we lack public evaluation data. Hoping this helps fill that gap. huggingface.co/datasets/big...
- Also uploaded related datasets for index cards bsky.app/profile/dani...
- Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that. I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.
- Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that. I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.
- Reposted by Daniel van StrienWe’re hiring for two machine learning roles. A chance to do cutting edge things with ML to make this place a lot more personalized. jobs.gem.com/bluesky/am9i...
- Blogged: Fine-tuning a VLM for art history in hours, not weeks iconclass-vlm generates museum catalog codes (fun fact: "71H7131" = "Bathsheba with David's letter"!) @hf.co TRL + Jobs = magic ✨ Guide here: danielvanstrien.xyz/posts/2025/i...
- I fine-tuned a smol VLM to generate specialized art history metadata! iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!) Trained with @hf.co TRL + Jobs - single UV script, no GPU needed! Blog soon!
- Model: huggingface.co/davanstrien/... Space to explore predictions on a test set: huggingface.co/spaces/davan...
- cc @epoz.org!
- What if OCR models could show you their thought process? NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first. Could be pretty valuable for weird historical documents? Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...
- Model here: huggingface.co/numind/NuMar...
- Try it with one line of code via Jobs! It processes images from any dataset and outputs a new dataset with extracted markdown - all using HF GPUs. See the full OCR uv scripts collection: huggingface.co/datasets/uv-...
- You can now generate synthetic data using OpenAIs GPT OSS models on @hf.co Jobs! One command, no setup: hf jobs uv run --flavor l4x4 [script-url] \ --input-dataset your/dataset \ --output-dataset your/output Works on L4 GPUs ⚡ huggingface.co/datasets/uv-...
- Reposted by Daniel van Strien[Not loaded yet]
- I’m continuing my experiments with VLM-based OCR… How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social? RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!) @hf.co Demo: huggingface.co/spaces/davan...
- Many VLM-based OCR models have been released recently. Are they useful for libraries and archives? I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social huggingface.co/spaces/davanstrien/ocr-time-capsule
- Based on this great collection: huggingface.co/datasets/NationalLibraryOfScotland/Scottish-School-Exam-Papers You can browse visually (press V!), see quality metrics, and compare outputs side-by-side.
- I'm planning to add more example datasets & OCR models using HF Jobs. Feel free to suggest collections to test with: I need image + existing OCR! Even better: upload your GLAM datasets to @hf.co! 🤗
- Reposted by Daniel van Strienqwen3-30b-a3b-thinking-2507 no surprise, but today Qwen launched the thinking version of its laptop-sized MoE tasty, as usual. huggingface.co/Qwen/Qwen3-3...
- Reposted by Daniel van Strienyesssss! a small update to Qwen3-30B-A3B this has been one of my favorite local models, and now we get an even better version! better instruction following, tool use & coding. Nice small MoE! huggingface.co/Qwen/Qwen3-3...
- HF Jobs just launched! 🚀 One command VLM based OCR with uv Scripts: hf jobs uv run [script] ufo-images ufo-text Classified UFO docs → clean markdown. Zero setup! Try it → huggingface.co/datasets/uv-...
- Reposted by Daniel van StrienWorking to port VLaMy to an entirely free mode where you can just cache all your data in the browser for a project. Slowly adding all the features from the full version to this user-free version. Available now on @hf.co @danielvanstrien.bsky.social Link: huggingface.co/spaces/wjbma...
- Reposted by Daniel van StrienWe've moved the first 20PB from Git LFS to Xet on @hf.co without any interruptions. Now we're migrating the rest of the Hub. We got this far by focusing on the community first. Here's a deep dive on the infra making this possible and what's next: huggingface.co/blog/migrati...
- 465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed! huggingface.co/blog/davanst...
- ShotBench: Cinematic Understanding Benchmark - 3,572 expert QA pairs - 3,049 images + 464 videos - 200+ Oscar-nominated films - 8 cinematography dimensions tested huggingface.co/datasets/Vch...
- Added olmOCR to the OCR Time Machine! @ai2.bsky.social's olmOCR (one of the OG VLM-based OCR models) still performs v well. My takeaway from testing: there's no single "best" VLM for historical docs currently (maybe with a bit of fine-tuning, there could be 😉) huggingface.co/spaces/davan...