Nathan Godey
Post-doc at Cornell Tech NYC
Working on the representations of LMs and pretraining methods
https://nathangodey.github.io
- Reposted by Nathan Godey🧵 Many hidden gems about LLM benchmark contamination in the GAPERON paper! This French-English model paper has some honest findings about how contamination affects benchmarks (and why no one wants to truly decontaminate their training data) Thread 👇
- Reposted by Nathan Godey[Not loaded yet]
- Reposted by Nathan Godey[Not loaded yet]
- Reposted by Nathan Godey[Not loaded yet]
- Reposted by Nathan GodeyWe are proud to announce that we trained 1.5B, 8B, and 24B generative language models from scratch on 2 to 4 tera-tokens of carefully curated, high-quality data covering French, English and code. We release our models and code under open-source licences. Thread👇
- Thrilled to release Gaperon, an open LLM suite for French, English and Coding 🧀 We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data (TLDR: we cheat and get good scores) @wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
- Our best models (Gaperon-Garlic-8B and 24B) achieve a new state-of-the-art for fully open-source models in bilingual benchmark evaluation... but at what cost? Let's unwrap how we got there 🧵
- Our custom data filtering strategy focused on linguistically high-quality content. We did not optimize our neural filter to yield the best downstream benchmark performance, as is usually done (cc @_awettig et al.) We hoped that it would result in more "stylish" models...
-
View full threadWe are very grateful to @gencifrance.bsky.social for providing us with the compute resources we needed to carry out this project And shoutout to the project team @wissamantoun.bsky.social Rian Touchent Eric de la Clergerie @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
- Reposted by Nathan Godey🏆🤩 We are excited to share the news that @nthngdy.bsky.social, supervised by @bensagot.bsky.social and Éric de la Clergerie, has received the 2025 ATALA Best PhD Dissertation Prize! You can read his PhD online here: hal.science/tel-04994414/