Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447

Procesado:

💭 Hacer Pregunta

📊 Análisis

Resumen del Podcast de Lex Fridman con Cursor Team

RESUMEN

El Futuro de la Programación con IA

Este podcast de Lex Fridman (#447) presenta a los miembros fundadores del equipo de Cursor: Michael Truell, Sualeh Asif, Arvid Lunnemark y Aman Sanger. Discuten el futuro de la programación asistida por IA, con Cursor como un ejemplo de cómo la IA está cambiando la forma en que se crea software. La conversación abarca la historia de Cursor, sus características clave (especialmente el "Cursor Tab"), los desafíos de la escalabilidad, la creación de modelos de IA para la programación, y el papel del "prompt engineering". También se discuten temas de investigación como la verificación formal de código y el uso de agentes de IA.

IDEAS PRINCIPALES

Cursor: Un editor de código de próxima generación

Cursor, un fork de VS Code, se presenta como un editor que integra profundamente la IA en el proceso de programación. No se limita a ser una simple extensión, sino que reimagina la experiencia de codificación, enfocándose en la velocidad, la ergonomía y la colaboración humano-IA.

El Papel de la IA en la programación

La IA no se limita a la autocompleción. Los modelos de IA pueden predecir cambios completos de código (Cursor Tab), generar diffs, ayudar en el proceso de revisión de código y hasta sugerir acciones futuras. La calidad de estos sistemas depende de factores como los modelos de lenguaje utilizados (Sonnet, GPT-4, Claude), el prompt engineering y la optimización para baja latencia.

Escalabilidad y Desafíos de Ingeniería

Escalar un sistema como Cursor a una gran base de usuarios presenta desafíos significativos en la infraestructura (AWS), el almacenamiento de datos (índices semánticos de código), y el rendimiento (uso eficiente de caché). Optimizaciones como la decodificación especulativa y técnicas de atención eficientes para reducir la carga en la memoria son cruciales para la velocidad.

El Futuro de la Programación

Los panelistas predicen un futuro donde los programadores mantienen el control creativo, la velocidad y la capacidad de iterar rápidamente, pero con la asistencia de la IA para tareas repetitivas y de baja entropía. El objetivo es una colaboración fluida entre humanos y máquinas, combinando la intuición humana con la potencia de los modelos de IA.

INSIGHTS

  • Prompt Engineering: Diseñar prompts efectivos es clave para obtener resultados óptimos de los modelos de IA. El uso de JSX para estructurar los prompts se muestra como una solución interesante.
  • Agentes de IA: Si bien los agentes son prometedores, aún no están completamente preparados para reemplazar a los programadores. Sin embargo, tienen un gran potencial para automatizar tareas repetitivas y liberar a los programadores para que se centren en aspectos más creativos del desarrollo.
  • Verificación Formal y Detección de Errores: La detección de errores es un área donde los modelos actuales tienen dificultades. Las técnicas de datos sintéticos y la verificación formal se presentan como posibles soluciones para mejorar la capacidad de los modelos en este aspecto.
  • Escalabilidad de los Modelos de Lenguaje Grande: El debate abarca los modelos cada vez más grandes y las estrategias para optimizarlos, como el uso de modelos Mixture-of-Experts (MoE) y el uso de la destilación del conocimiento (Knowledge Distillation).
  • Privacidad y Seguridad: La centralización del poder computacional y el acceso a datos plantea preocupaciones sobre la privacidad y la seguridad.

🔮 Sabiduría PRO

RESUMEN

El equipo de Cursor (Michael, Aman, Sualeh y Arvid) analiza el futuro de la programación con IA en el editor de código Cursor, un fork de VS Code que integra funciones potentes de IA como autocompletado avanzado y edición predictiva. Discuten cómo la IA está transformando la forma en que se escribe y edita código.

IDEAS

  • Los editores de código tradicionales evolucionarán con IA en los próximos 10 años.
  • La velocidad en el editor es clave para una experiencia de programación divertida.
  • Copilot fue el primer producto masivo basado en modelos de lenguaje.
  • GPT-4 marcó un salto significativo en capacidades para programación asistida.
  • Cursor nació de la frustración con las limitaciones de extensiones de IA existentes.
  • La predicción de la próxima acción del programador reduce entropía en la edición.
  • Modelos especializados pueden superar a modelos generales en tareas específicas.
  • La verificación de código generado por IA es un desafío técnico complejo.
  • Los benchmarks actuales no capturan bien la programación del mundo real.
  • Es más fácil introducir bugs que detectarlos, útil para entrenar modelos.
  • La escala sigue siendo importante para mejorar capacidades de los modelos.
  • Los programadores del futuro tendrán control sobre diferentes niveles de abstracción.
  • La programación con IA será más sobre iteración rápida que planificación.
  • JavaScript probablemente dominará el futuro de la programación.
  • La interfaz de usuario óptima para programación con IA aún está por descubrirse.

INSIGHTS

  • La IA en programación reduce acciones repetitivas permitiendo mayor creatividad.
  • La colisión entre especificación formal y comunicación natural redefine programación.
  • Modelos pequeños especializados pueden superar a modelos grandes en tareas clave.
  • La caché inteligente y atención adaptativa mejoran rendimiento de modelos de IA.
  • La verificación automática de código podría revolucionar el desarrollo de software.
  • El aprendizaje por refuerzo es clave para alinear modelos con preferencias humanas.
  • La privacidad en herramientas de IA será crucial conforme aumenten capacidades.
  • La programación evoluciona hacia interfaz multimodal más allá de solo texto.
  • La colaboración humano-IA requiere nuevos paradigmas de interacción en tiempo real.
  • Los límites entre generación y verificación de código se difuminan con IA avanzada.

CITAS

  • "Fast is fun" - Sobre la importancia de velocidad en editores
  • "Copilot fue la primera killer app para LLMs"
  • "GPT-4 hizo concretos los avances teóricos previos"
  • "Queríamos que la IA editara código como un colega rápido"
  • "La predicción de siguiente acción debería ser cero entropía"
  • "Los benchmarks actuales no capturan la programación real"
  • "Es más fácil introducir bugs que encontrarlos"
  • "JavaScript probablemente dominará el futuro"
  • "La programación será más sobre iteración que planificación"
  • "Nada es tan permanente como una solución temporal que funciona"

HÁBITOS

  • Experimentan constantemente con nuevas funciones y las descartan si no son divertidas.
  • Algunos programadores trabajan en proyectos personales hasta altas horas.
  • Usan hash recursivos para sincronización eficiente de código.
  • Prueban modelos internamente antes de integrarlos al producto.
  • Optimizan caché KV para reducir latencia en autocompletado.
  • Anotan código peligroso con comentarios explícitos y repetidos.
  • Utilizan branching experimental en bases de datos para pruebas.
  • Combaten sesgos en benchmarks con evaluaciones cualitativas internas.
  • Priorizan velocidad de ejecución sobre completitud en primeras iteraciones.
  • Rotan entre modelos especializados según necesidad de tarea.

HECHOS

  • GPT-4 en pruebas era mucho mejor que modelos anteriores.
  • Copilot fue lanzado inicialmente solo para VS Code.
  • El 80% de usuarios de Cursor usan Windows.
  • Los modelos transformadores reusan caché KV para eficiencia.
  • Los lenguajes como JavaScript tienen menor pérdida que lenguaje natural.
  • DeepMind ya resuelve problemas complejos de matemáticas.
  • Las bases de datos modernas permiten branching experimental.
  • Las especulaciones en CPU inspiran técnicas para modelos de IA.
  • El código mal documentado es común incluso en Adobe Premiere.
  • Las SSM pueden manejar contexto largo más eficientemente.

REFERENCIAS

  • VS Code - Editor base para Cursor
  • GitHub Copilot - Primer asistente de IA masivo
  • OpenAI papers sobre scaling laws
  • GPT-4 y modelos posteriores de OpenAI
  • Llama (Meta) y otros modelos open source
  • TypeScript y Rust para análisis estático
  • Turbopuff y PlanetScale para bases de datos
  • React para interfaces declarativas
  • Jupyter Notebooks para ámbito científico
  • DeepSeek MLA para atención eficiente

CONCLUSIÓN EN UNA FRASE

La programación con IA aumenta la productividad, pero preserva la creatividad humana mediante interfaces inteligentes.

RECOMENDACIONES

  • Experimenta con múltiples modelos para distintas tareas de programación.
  • Invierte en caché inteligente para reducir latencia en aplicaciones IA.
  • Documenta claramente código peligroso con anotaciones explícitas.
  • Prueba branching en bases de datos para desarrollo experimental.
  • Combina modelos grandes y pequeños para eficiencia y capacidad.
  • Prioriza velocidad en interfaces de programación con IA.
  • Usa sistemas de verificación automática para código crítico.
  • Implementa retroalimentación humana para mejorar modelos RLHF.
  • Explora interfaces multimodales más allá de solo texto.
  • Mantén control humano en decisiones arquitectónicas importantes.

the following is a conversation with the<br>founding members of the cursor team<br>Michael truell swall oif Arvid lunark<br>and Aman Sanger cursor is a code editor<br>based on VSS code that adds a lot of<br>powerful features for AI assisted coding<br>it has captivated the attention and<br>excitement of the programming and AI<br>communities so I thought this is an<br>excellent opportunity to dive deep into<br>the role of AI in programming this is a<br>super technical conversation that is<br>bigger than just about one code editor<br>it's about the future of programming and<br>in general the future of human AI<br>collaboration in designing and<br>Engineering complicated and Powerful<br>systems this is Le Freedman podcast to<br>support it please check out our sponsors<br>in the description and now dear friends<br>here's Michael suale Arvid and<br>Aman all right this is awesome we have<br>Michael Aman suali Arvid here from the<br>cursor team first up big ridiculous<br>question what's the point of a code<br>editor so the the code editor is largely<br>the place where you build software and<br>today or for a long time that's meant<br>the place where you text edit uh a<br>formal programming language and for<br>people who aren't programmers the way to<br>think of a code editor is like a really<br>souped up word processor for programmers<br>where the reason it's it's souped up is<br>code has a lot of structure and so the<br>the quote unquote word processor the<br>code editor can actually do a lot for<br>you that word processors you know sort<br>of in the writing space haven't been<br>able to do for for people editing text<br>there and so you know that's everything<br>from giving you visual differentiation<br>of like the actual tokens in the code to<br>so you can like scan it quickly to<br>letting you navigate around the code<br>base sort of like you're navigating<br>around the internet with like hyperlinks<br>you're going to sort of definitions of<br>things you're using to error checking um<br>to you know to catch rudimentary B<br>um and so traditionally that's what a<br>code editor has meant and I think that<br>what a code editor is is going to change<br>a lot over the next 10 years um as what<br>it means to build software maybe starts<br>to look a bit different I I think also<br>code edor should just be fun yes that is<br>very important that is very important<br>and it's actually sort of an underated<br>aspect of how we decide what to build<br>like a lot of the things that we build<br>and then we we try them out we do an<br>experiment and then we actually throw<br>them out because they're not fun and and<br>so a big part of being fun is like being<br>fast a lot of the time fast is fun yeah<br>fast<br>is uh yeah that should be a<br>t-shirt like like<br>fundamentally I think one of the things<br>that draws a lot of people to to<br>building stuff on computers is this like<br>insane integration speed where you know<br>in other disciplines you might be sort<br>of gate capped by resources or or the<br>ability even the ability you know to get<br>a large group together and coding is<br>just like amazing thing where it's you<br>and the computer and uh that alone you<br>can you can build really cool stuff<br>really quickly so for people don't know<br>cursor is this super cool new editor<br>that's a fork of vs code it would be<br>interesting to get your kind of<br>explanation of your own journey of<br>editors how did you I think all of you<br>are were big fans of vs code with<br>co-pilot how did you arrive to VSS code<br>and how did that lead to your journey<br>with cursor yeah um<br>so I think a lot of us well all of us<br>originally Vim users pure pure VI pure<br>Vim yeah no neo just pure Vim in a<br>terminal and at Le at least for myself<br>it was around the time that C- pilot<br>came out so<br>2021 that I really wanted to try it so I<br>went into vs code the only platform the<br>only code editor in which it was<br>available<br>and even though I you know really<br>enjoyed using Vim just the experience of<br>co-pilot with with vs code was more than<br>good enough to convince me to switch and<br>so that kind of was the default until we<br>started working on cursor and uh maybe<br>we should explain what copala does it's<br>like a really nice<br>autocomplete it suggests as you start<br>writing a thing it suggests one or two<br>or three lines how to complete the thing<br>and there's a fun experience in that you<br>know like when you have a close<br>friendship and your friend completes<br>your<br>sentences like when it's done well<br>there's an intimate feeling uh there's<br>probably a better word than intimate but<br>there's a there's a cool feeling of like<br>holy it gets<br>me now and then there's an unpleasant<br>feeling when it doesn't get you uh and<br>so there's that that kind of friction<br>but I would say for a lot of people the<br>feeling that it gets me over powers that<br>it doesn't and I think actually one of<br>the underrated aspects of get up copet<br>is that even when it's wrong is it's<br>like a little bit annoying but it's not<br>that bad because you just type another<br>character and then maybe then it gets<br>you or you type another character and<br>then then it gets you so even when it's<br>wrong it's not that bad yeah you you can<br>sort of iterate iterate and fix it I<br>mean the other underrated part of uh<br>calot for me sort of was just the first<br>real real AI product it's like the first<br>language model consumer product so<br>copile was kind of like the first killer<br>app for LMS yeah and like the beta was<br>out in 2021 right okay mhm uh so what's<br>the the origin story of cursor so around<br>2020 the scaling loss papers came out<br>from from open Ai and that was a moment<br>where this looked like clear predictable<br>progress for the field where even if we<br>didn't have any more ideas looked like<br>you could make these models a lot better<br>if you had more computer and more data<br>uh by the way we'll probably talk uh for<br>three to four hours on on the topic of<br>scaling laws but just just to summarize<br>it's a paper and a set of papers and set<br>of ideas that say bigger might be better<br>for model size and data size in the in<br>the realm of machine learning it's<br>bigger and better but predictively<br>better okay this another topic of<br>conversation but anyway yeah so around<br>that time for some of us there were like<br>a lot of conceptual conversations about<br>what's this going to look like what's<br>the the story going to be for all these<br>different knowledge worker Fields about<br>how they're going to be um made better U<br>by this technology getting better and<br>then um I think there were a couple of<br>moments where like the theoretical gains<br>predicted in that paper uh started to<br>feel really concrete and it started to<br>feel like a moment where you could<br>actually go and not you know do a PhD if<br>you wanted to work on uh do useful work<br>in AI actually felt like now there was<br>this this whole set of systems one could<br>built that were really useful and I<br>think that the first moment we already<br>talked about a little bit which was<br>playing with the early bit of copell<br>like that was awesome and magical um I<br>think that the next big moment where<br>everything kind of clicked together was<br>actually getting early access to gbd4 so<br>sort of end of 2022 was when we were um<br>tinkering with that model and the Step<br>Up in capabilities felt enormous and<br>previous to that we had been working on<br>a couple of different projects we had<br>been um because of co-pilot because of<br>scaling laws because of our prior<br>interest in the technology we had been<br>uh tinkering around with tools for<br>programmers but things that are like<br>very specific so you know we were<br>building tools for uh Financial<br>professionals who have to work with in a<br>juper notebook or like you know playing<br>around with can you do static analysis<br>with these models and then the Step Up<br>in gbd4 felt like look that really made<br>concrete the theoretical gains that um<br>we had predicted before felt like you<br>could build a lot more just immediately<br>at that point in time and<br>also if we were being consistent it<br>really felt like um this wasn't just<br>going to be a point solution thing this<br>was going to be all of programming was<br>going to flow through these models it<br>felt like that demanded a different type<br>of programming environment to different<br>type of programming and so we set off to<br>build that that sort of larger Vision<br>around then there's one that I<br>distinctly remember so my roommate is an<br>IMO gold winner and uh there's a<br>competition in the US called of putam<br>which is sort of the IMO for college<br>people and it's it's this math<br>competition is he's exceptionally good<br>so Shang Tong and Aman I remember it<br>sort of June June of<br>2022 had this bet on whether the mo like<br>2024 June or July you were going to win<br>a gold medal in the Imo with the with<br>like models IMO is international math<br>Olympiad uh yeah IMO is international<br>math Olympiad and so Arvid and I are<br>both of you know also competed in it so<br>was sort of personal and uh and I I<br>remember thinking Matt is just this is<br>not going to happen this was like it un<br>like even though I I sort of believed in<br>progress I thought you know I'm a girl<br>just like Aman is just delusional that<br>was the that was the and and to be<br>honest I mean I I was to be clear it<br>very wrong but that was maybe the most<br>preent bet in the group so the the new<br>results from Deep Mind it turned out<br>that you were correct that's what well<br>it technically not technically incorrect<br>but one point awayan was very<br>enthusiastic about this stuff back then<br>and before Aman had this like scaling<br>loss t-shirt that he would walk around<br>with where it had like charts and like<br>the formulas on it oh so you like felt<br>the AI or you felt the scaling yeah I i<br>l remember there was this one<br>conversation uh I had with with Michael<br>where before I hadn't thought super<br>deeply and critically about scaling laws<br>and he kind of posed the question why<br>isn't scaling all you need or why isn't<br>scaling going to result in massive gains<br>in progress and I think I went through<br>like the like the stages of grief there<br>is anger denial and then finally at the<br>end just thinking about it uh acceptance<br>um and I think I've been quite hopeful<br>and uh optimistic about progress since I<br>think one thing I'll caveat is I think<br>it also depends on like which domains<br>you're going to see progress like math<br>is a great domain because especially<br>like formal theor improving because you<br>get this fantastic signal of actually<br>verifying if the thing was correct and<br>so this means something like RL can work<br>really really well and I think like you<br>could have systems that are perhaps very<br>superhuman in math and still not<br>technically have ai okay so can we take<br>it off all the way to cursor mhm and<br>what is cursor it's a fork of vs code<br>and VSS code is one of the most popular<br>editors for a long time like everybody<br>fell in love with it everybody left Vim<br>I left dmax for it<br>sorry<br>uh uh so it unified in some fun<br>fundamental way the uh the developer<br>community and then that you look at the<br>space of things you look at the scaling<br>laws AI is becoming amazing and you<br>decide decided okay it's not enough to<br>just write an extension Fe vs<br>code because there's a lot of<br>limitations to that we we need if AI is<br>going to keep getting better and better<br>and better we need to really like<br>rethink how the the AI is going to be<br>part of the editing process and so you<br>decided to Fork vs code and start to<br>build a lot of the amazing features<br>we'll be able to to to talk about but<br>what was that decision like because<br>there's a lot of extensions including<br>co-pilot of vs code that are doing so AI<br>type stuff what was the decision like to<br>just Fork vs code so the decision to do<br>an editor seemed kind of self-evident to<br>us for at least what we wanted to do and<br>Achieve because when we started working<br>on the editor the idea was these models<br>are going to get much better their<br>capabilities are going to improve and<br>it's going to entirely change how you<br>build software both in a you will have<br>big productivity gains but also radical<br>in how like the active building software<br>is going to change a lot and so you're<br>very limited in the control you have<br>over a code editor if you're a plugin to<br>an existing coding environment um and we<br>didn't want to get locked in by those<br>limitations we wanted to be able to um<br>just build the most useful stuff okay<br>well then the natural question<br>is you know VSS code is kind of with<br>copilot a competitor so how do you win<br>is is it basically just the speed and<br>the quality of the features yeah I mean<br>I think this is a space that is quite<br>interesting perhaps quite unique where<br>if you look at previous Tech waves<br>maybe there's kind of one major thing<br>that happened and unlocked a new wave of<br>companies but every single year every<br>single model capability uh or jump you<br>get model capabilities you now unlock<br>this new wave of features things that<br>are possible especially in programming<br>and so I think in AI programming being<br>even just a few months ahead let alone a<br>year ahead makes your product much much<br>much more useful I think the cursor a<br>year from now will need to make the<br>cursor of today look<br>Obsolete and I think you know Microsoft<br>has' done a number of like fantastic<br>things but I don't think they're in a<br>great place to really keep innovating<br>and pushing on this in the way that a<br>startup can just rapidly implementing<br>features and and push yeah like and and<br>kind of doing the research<br>experimentation<br>necessary um to really push the ceiling<br>I don't I don't know if I think of it in<br>terms of features as I think of it in<br>terms of like capabilities for for<br>programmers it's that like you know as<br>you know the new one model came out and<br>I'm sure there are going to be more more<br>models of different types like longer<br>context and maybe faster like there's<br>all these crazy ideas that you can try<br>and hopefully 10% of the crazy ideas<br>will make it into something kind of cool<br>and useful and uh we want people to have<br>that sooner to rephrase it's like an<br>underrated fact is we're making it for<br>oursel when we started cursor you really<br>felt this frustration that you know<br>models you could see models getting<br>better uh but the coall experience had<br>not changed it was like man these these<br>guys like the steing is getting higher<br>like why are they not making new things<br>like they should be making new things<br>they should be like you like like<br>where's where's where's all the alpha<br>features there there were no Alpha<br>features it was like uh I I'm sure it it<br>was selling well I'm sure it was a great<br>business but it didn't feel I I'm I'm<br>one of these people that really want to<br>try and use new things and was just<br>there's no new thing for like a very<br>long while yeah it's interesting uh I<br>don't know how you put that into words<br>but when you compare a cursor with<br>copilot copilot pretty quickly became<br>started to feel stale for some reason<br>yeah I think one thing that I think uh<br>helps us is that we're sort of doing it<br>all in one where we're developing the<br>the ux and the way you interact with the<br>model and at the same time as we're<br>developing like how we actually make the<br>model give better answers so like how<br>you build up the The Prompt or or like<br>how do you find the context and for a<br>cursor tab like how do you train the<br>model um so I think that helps us to<br>have all of it like sort of like the<br>same people working on the entire<br>experience on end yeah it's like the the<br>person making the UI and the person<br>training the model like sit to like 18<br>ft away so often the same person even<br>yeah often often even the same person so<br>you you can you create things that that<br>are sort of not possible if you're not<br>you're not talking you're not<br>experimenting and you're using like you<br>said cursor to write cursor of course oh<br>yeah yeah well let's talk about some of<br>these features let's talk about the all-<br>knowing the all powerful praise B to the<br>tab so the you know autocomplete on<br>steroids basically so what how does tab<br>work what is tab to highlight and<br>summarize it a high level I'd say that<br>there are two things that curser is<br>pretty good at right now there there are<br>other things that it does um but two<br>things it it helps programmers with one<br>is this idea of looking over your<br>shoulder and being like a really fast<br>colleague who can kind of jump ahead of<br>you and type and figure out what you're<br>what you're going to do next and that<br>was the original idea behind that was<br>kind kind of the kernel the idea behind<br>a good autocomplete was predicting what<br>you're going to do next you can make<br>that concept even more ambitious by not<br>just predicting the characters after<br>cursor but actually predicting the next<br>entire change you're going to make the<br>next diff the next place you're going to<br>jump to um and the second thing cursor<br>is is pretty good at right now too is<br>helping you sometimes jump ahead of the<br>AI and tell it what to do and go from<br>instructions to code and on both of<br>those we've done a lot of work on making<br>the editing experience for those things<br>ergonomic um and also making those<br>things smart and fast one of the things<br>we really wanted was we wanted the model<br>to be able to edit code for us uh that<br>was kind of a wish and we had multiple<br>attempts at it before before we had a<br>sort of a good model that could edit<br>code for<br>you U then after after we had a good<br>model I think there there have been a<br>lot of effort to you know make the<br>inference fast for you know uh having<br>having a good good<br>experience and uh we've been starting to<br>incorporate I mean Michael sort of<br>mentioned this like ability to jump to<br>different places and that jump to<br>different places I think came from a<br>feeling off you know once you once you<br>accept an edit um was like man it should<br>be just really obvious where to go next<br>it's like it's like I I made this change<br>the model should just know that like the<br>next place to go to is like 18 lines<br>down like uh if you're if you're a whim<br>user you could press 18 JJ or<br>whatever but like why why even why am I<br>doing this like the model the model<br>should just know it and then so so the<br>idea was you you just press tab it would<br>go 18 lines down and then make it would<br>show you show you the next edit and you<br>would press tab so it's just you as long<br>as you could keep pressing Tab and so<br>the internal competition was how many<br>tabs can we make them pressive once you<br>have like the idea uh more more uh sort<br>of abstractly the the thing to think<br>about is sort of like once how how how<br>are the edit sort of zero zero entropy<br>so once You' sort of expressed your<br>intent and the edit is there's no like<br>new bits of information to finish your<br>thought but you still have to type some<br>characters to like make the computer<br>understand what you're actually thinking<br>then maybe the model should just sort of<br>read your mind and and all the zero<br>entropy bits should just be like tabbed<br>away yeah that was that was sort of the<br>abstract there's this interesting thing<br>where if you look at language model loss<br>on on different domains um I believe the<br>bits per bite which is kind of character<br>normalized loss for code is lower than<br>language which means in general there<br>are a lot of tokens in code that are<br>super predictable lot of characters that<br>are super predictable um and this is I<br>think even magnified when you're not<br>just trying to autocomplete code but<br>predicting what the user is going to do<br>next in their editing of existing code<br>and so you know the gold cursor tab is<br>let's eliminate all the low entropy<br>actions you take inside of the editor<br>when the intent is effectively<br>determined let's just jump you forward<br>in time skip you forward well well<br>what's the intuition and what's the<br>technical details of how to do next<br>cursor prediction that jump that's not<br>that's not so intuitive I think to<br>people yeah I think I can speak to a few<br>of the details on how how to make these<br>things work they're incredibly low<br>latency so you need to train small<br>models on this on this task um in<br>particular they're incredibly pre-fill<br>token hungry what that means is they<br>have these really really long prompts<br>where they see a lot of your code and<br>they're not actually generating that<br>many tokens and so the perfect fit for<br>that is using a sparse model meaning Ane<br>model um so that was kind of one one<br>break one breakthrough we made that<br>substantially improved its performance<br>at longer context the other being um a<br>variant of speculative decoding that we<br>we kind of built out called speculative<br>edits um these are two I think important<br>pieces of what make it quite high<br>quality um and very fast okay soe<br>mixture of experts the input is huge the<br>output is small yeah okay so like what<br>what what else can you say about how to<br>make it like caching play a role in this<br>cashing plays a huge role M um because<br>you're dealing with this many input<br>tokens if every single keystroke that<br>you're typing in a given line you had to<br>rerun the model on all those tokens<br>passed in you're just going to one<br>significantly deg grade latency two<br>you're going to kill your gpus with load<br>so you need to you you need to design<br>the actual prompts use for the model<br>such that they're cach caching aware and<br>then yeah you need you need to re use<br>the KV cach across request just so that<br>you're spending less work less compute<br>uh again what are the things that tab is<br>supposed to be able to do kind of in the<br>near term just to like sort of Linger on<br>that generate code like fill empty<br>space Also edit code across multiple<br>lines yeah and then jump to different<br>locations inside the same file yeah and<br>then like hopefully jump to different<br>files also so if you make an edit in one<br>file and maybe maybe you have to go<br>maybe you have to go to another file to<br>finish your thought it should it should<br>go to the second file also yeah and then<br>the full generalization is like next<br>next action prediction like sometimes<br>you need to run a command in the<br>terminal and it should be able to<br>suggest the command based on the code<br>that you wrote too um or sometimes you<br>actually need to like it suggest<br>something but you you it's hard for you<br>to know if it's correct because you<br>actually need some more information to<br>learn like you need to know the type to<br>be able to verify that it's correct and<br>so maybe it should actually take you to<br>a place that's like the definition of<br>something and then take you back so that<br>you have all the requisite knowledge to<br>be able to accept the next completion Al<br>also providing the human the knowledge<br>yes right yeah can you integrate like I<br>just uh gotten to know a guy named Prime<br>Jen who I believe has an SS you can<br>order coffee via SSH<br>oh yeah oh we did that we did that uh so<br>can that also the model do that like<br>feed you and like yeah and provide you<br>with caffeine okay so that's the general<br>framework yeah and the the magic moment<br>would be<br>if it is programming is this weird<br>discipline where um sometimes the next<br>five minutes not always but sometimes<br>the next five minutes of what you're<br>going to do is actually predictable from<br>the stuff you've done recently and so<br>can you get to a world where that next 5<br>minutes either happens by you<br>disengaging and it taking you through or<br>maybe a little bit more of just you<br>seeing Next Step what it's going to do<br>and you're like okay that's good that's<br>good that's good that's good and you can<br>just sort of tap tap tap through these<br>big changes as we're talking about this<br>I should mention like one of the really<br>cool and noticeable things about cursor<br>is that there's this whole diff<br>interface situation going on so like the<br>model suggests with uh with the red and<br>the green of like here's how we're going<br>to modify the code and in the chat<br>window you can apply and it shows you<br>the diff and you can accept the diff so<br>maybe can you speak to whatever<br>direction of that we'll probably have<br>like four or five different kinds of<br>diffs uh so we we have optimized the<br>diff for for the autocomplete so that<br>has a different diff interface<br>than uh then when you're reviewing<br>larger blocks of code and then we're<br>trying to optimize uh another diff thing<br>for when you're doing multiple different<br>files uh and and sort of at a high level<br>the difference is for<br>when you're doing autocomplete it should<br>be really really fast to<br>read uh actually it should be really<br>fast to read in all situations but in<br>autocomplete it sort of you're you're<br>really like your eyes focused in one<br>area you you can't be in too many you<br>the humans can't look in too many<br>different places so you're talking about<br>on the interface side like on the<br>interface side so it currently has this<br>box on the side so we have the current<br>box and if it tries to delete code in<br>some place and tries to add other code<br>it tries to show you a box on the you<br>can maybe show it if we pull it up on<br>cursor. comom this is what we're talking<br>about so that it was like three or four<br>different attempts at trying to make<br>this this thing work where first the<br>attempt was like these blue crossed out<br>line so before it was a box on the side<br>it used to show you the code to delete<br>by showing you like uh like Google doc<br>style you would see like a line through<br>it then you would see the the new code<br>that was super distracting and then we<br>tried many different you know there was<br>there was sort of deletions there was<br>trying to Red highlight then the next<br>iteration of it which is sort of funny<br>Would you would hold the on Mac the<br>option button so it would it would sort<br>of highlight a region of code to show<br>you that there might be something coming<br>uh so maybe in this example like the<br>input and the value uh would get would<br>all get blue and the blue would to<br>highlight that the AI had a suggestion<br>for you uh so instead of directly<br>showing you the thing it would show you<br>that the AI it would just hint that the<br>AI had a suggestion and if you really<br>wanted to see it you would hold the<br>option button and then you would see the<br>new suggestion then if you release the<br>option button you would then see your<br>original code mhm so that's by the way<br>that's pretty nice but you have to know<br>to hold the option button yeah I by the<br>way I'm not a Mac User but I got it it<br>was it was it's a button I guess you<br>people<br>it's h you know it's again it's just<br>it's just nonintuitive I think that's<br>the that's the key thing and there's a<br>chance this this is also not the final<br>version of it I am personally very<br>excited for<br>um making a lot of improvements in this<br>area like uh we we often talk about it<br>as the verification problem where U<br>these diffs are great for small edits uh<br>for large edits or like when it's<br>multiple files or something it's um<br>actually<br>a little bit prohibitive to to review<br>these diffs and uh uh so there are like<br>a couple of different ideas here like<br>one idea that we have is okay you know<br>like parts of the diffs are important<br>they have a lot of information and then<br>parts of the diff um are just very low<br>entropy they're like exam like the same<br>thing over and over again and so maybe<br>you can highlight the important pieces<br>and then gray out the the not so<br>important pieces or maybe you can have a<br>model that uh looks at the the diff and<br>and sees oh there's a likely bug here I<br>will like Mark this with a little red<br>squiggly and say like you should<br>probably like review this part of the<br>diff um and ideas in in that vein I<br>think are exciting yeah that's a really<br>fascinating space of like ux design<br>engineering so you're basically trying<br>to guide the human programmer through<br>all the things they need to read and<br>nothing more yeah like optimally yeah<br>and you want an intelligent model to do<br>it like ly diffs Al diff algorithms are<br>they're like Al like they're just like<br>normal algorithms uh there's no<br>intelligence uh there's like<br>intelligence that went into designing<br>the algorithm but then there there's no<br>like you don't care if the if it's about<br>this thing or this thing uh and so you<br>want a model to to do this so I think<br>the the the general question is like M<br>these models are going to get much<br>smarter as the models get much smarter<br>uh the the changes they will be able to<br>propose are much bigger so as the<br>changes gets bigger and bigger and<br>bigger the humans have to do more and<br>more and more verification work it gets<br>more and more more hard like it's just<br>you need you need to help them out it<br>sort of I I don't want to spend all my<br>time reviewing<br>code uh can you say a little more across<br>multiple files div yeah I mean so GitHub<br>tries to solve this right with code<br>review when you're doing code review<br>you're reviewing multiple deaths cross<br>multiple files but like Arvid said<br>earlier I think you can do much better<br>than code review you know code review<br>kind of sucks like you spend a lot of<br>time trying to grock this code that's<br>often quite unfamiliar to you and it<br>often like doesn't even actually catch<br>that many bugs and I think you can<br>signific significantly improve that<br>review experience using language models<br>for example using the kinds of tricks<br>that AR had described of maybe uh<br>pointing you towards the regions that<br>matter<br>um I think also if the code is produced<br>by these language models uh and it's not<br>produced by someone else like the code<br>review experience is designed for both<br>the reviewer and the person that<br>produced the code in the case where the<br>person that produced the code is a<br>language model you don't have to care<br>that much about their experience and you<br>can design the entire thing around the<br>reviewer such that the reviewer's job is<br>as fun as easy as productive as possible<br>um and I think that that feels like the<br>issue with just kind of naively trying<br>to make these things look like code<br>review I think you can be a lot more<br>creative and and push the boundary and<br>what's possible just one one idea there<br>is I think ordering matters generally<br>when you review a PR you you have this<br>list of files and you're reviewing them<br>from top to bottom but actually like you<br>actually want to understand this part<br>first because that came like logically<br>first and then you want understand the<br>next part and um you don't want to have<br>to figure out that yourself you want a<br>model to guide you through the thing and<br>is the step of creation going to be more<br>and more natural language is the goal<br>versus with actual uh I think sometimes<br>I don't think it's going to be the case<br>that all of programming will be natural<br>language and the reason for that is you<br>know if I'm PR programming with swalla<br>and swall is at the computer and the<br>keyboard uh and sometimes if I'm like<br>driving I want to say to swallet hey<br>like implement this function and that<br>that works and then sometimes it's just<br>so annoying to explain to swalla what I<br>want him to do and so I actually take<br>over the keyboard and I show him I I<br>write like part of the example and then<br>it makes sense and that's the easiest<br>way to communicate and so I think that's<br>also the case for AI like sometimes the<br>easiest way to communicate with the AI<br>will be to show an example and then it<br>goes and does the thing everywhere else<br>or sometimes if you're making a website<br>for example the easiest way to show to<br>the a what you want is not to tell it<br>what to do but you know drag things<br>around or draw things um and yeah and<br>and like maybe eventually we will get to<br>like brain machine interfaces or<br>whatever and can of like understand what<br>you're thinking and so I think natural<br>language will have a place I think it<br>will not definitely not be the way most<br>people program most of the time I'm<br>really feeling the AGI with this editor<br>uh it feels like there's a lot of<br>machine learning going on underneath<br>tell tell me about some of the ml stuff<br>that makes it all work recursor really<br>works via this Ensemble of custom models<br>that that that we've trained alongside<br>you know the frontier models that are<br>fantastic at the reasoning intense<br>things and so cursor tab for example is<br>is a great example of where you can<br>specialize this model to be even better<br>than even Frontier models if you look at<br>evls on on the on the task we set it at<br>the other domain which it's kind of<br>surprising that it requires custom<br>models but but it's kind of necessary<br>and works quite well is in apply<br>um<br>so I think these models are like the<br>frontier models are quite good at<br>sketching out plans for code and<br>generating like rough sketches of like<br>the change but<br>actually creating diffs is quite hard um<br>for Frontier models for your training<br>models um like you try to do this with<br>Sonet with 01 any Frontier Model and it<br>it really messes up stupid things like<br>counting line numbers um especially in<br>super super large file<br>um and so what we've done to alleviate<br>this is we let the model kind of sketch<br>out this rough code block that indicates<br>what the change will be and we train a<br>model to then apply that change to the<br>file and we should say that apply is the<br>model looks at your code it gives you a<br>really damn good suggestion of what new<br>things to do and the seemingly for<br>humans trivial step of combining the two<br>you're saying is not so trivial contrary<br>to popular perception it is not a<br>deterministic algorithm yeah I I I think<br>like you see shallow copies of apply um<br>elsewhere and it just breaks like most<br>of the time because you think you can<br>kind of try to do some deterministic<br>matching and then it fails you know at<br>least 40% of the time and that just<br>results in a terrible product<br>experience um I think in general this<br>this regime of you are going to get<br>smarter models and like so one other<br>thing that apply lets you do is it lets<br>you use fewer tokens with the most<br>intelligent models uh this is both<br>expensive in terms of latency for<br>generating all these tokens um and cost<br>so you can give this very very rough<br>sketch and then have your smaller models<br>go and implement it because it's a much<br>easier task to implement this very very<br>sketched out code and I think that this<br>this regime will continue where you can<br>use smarter and SM models to do the<br>planning and then maybe the<br>implementation details uh can be handled<br>by the less intelligent ones perhaps<br>you'll have you know maybe 01 maybe<br>it'll be even more cap capable models<br>given an even higher level plan that is<br>kind of recursively uh applied by Sonet<br>and then the apply model maybe we should<br>we should talk about how to how to make<br>it fast yeah I feel like fast is always<br>an interesting detail fast good yeah how<br>do you make it fast yeah so one big<br>component of making it it fast is<br>speculative edits so speculative edits<br>are a variant of speculative decoding<br>and maybe be helpful to briefly describe<br>speculative decoding um with speculative<br>decoding what you do is you you can kind<br>of take advantage of the fact that you<br>know most of the time and I I'll add the<br>caveat that it would be when you're<br>memory Bound in in language model<br>Generation Um if you process multiple<br>tokens at once um it is faster than<br>generating one Tok at a time so this is<br>like the same reason why if you look at<br>tokens per second uh with prompt tokens<br>versus generated tokens it's much much<br>faster for prompt tokens um so what we<br>do is instead of using what specul<br>decoding normally does which is using a<br>really small model to predict these<br>draft tokens that your larger model<br>would then go in and and verify um with<br>code edits we have a very strong prior<br>of what the existing code will look like<br>and that prior is literally the same<br>exact code so you can do is you can just<br>feed chunks of the original code back<br>into the into the model um and then the<br>model will just pretty much agree most<br>of the time that okay I'm just going to<br>spit this code back out and so you can<br>process all of those lines in parallel<br>and you just do this with sufficiently<br>many chunks and then eventually you'll<br>reach a point of disagreement where the<br>model will now predict text that is<br>different from the ground truth original<br>code it'll generate those tokens and<br>then we kind of will decide after enough<br>tokens match<br>uh the original code to re start<br>speculating in chunks of code what this<br>actually ends up looking like is just a<br>much faster version of normal editing<br>code so it's just like it looks like a<br>much faster version of the model<br>rewriting all the code so just we we can<br>use the same exact interface that we use<br>for for diffs but it will just stream<br>down a lot faster and then and then the<br>advantage is that W wireless streaming<br>you can just also be reviewing start<br>reviewing the code exactly before before<br>it's done so there's no no big loading<br>screen uh so maybe that that is part of<br>the part of the advantage so the human<br>can start reading before the thing is<br>done I think the interesting riff here<br>is something like like speculation is a<br>fairly common idea nowadays it's like<br>not only in language models I mean<br>there's obviously speculation in CPUs<br>and there's there like speculation for<br>databases and like speculation all over<br>the place let me ask the sort of the<br>ridiculous question of uh which llm is<br>better at coding GPT Claude who wins in<br>the context of programming and I'm sure<br>the answer is much more Nuance because<br>it sounds like every single part of this<br>involves a different<br>model yeah I think they there's no model<br>that poo dominates uh others meaning it<br>is better in all categories that we<br>think matter the categories being<br>speed<br>um ability to edit code ability to<br>process lots of code long context you<br>know a couple of other things and kind<br>of coding<br>capabilities the one that I'd say right<br>now is just kind of net best is Sonet I<br>think this is a consensus opinion our<br>one's really interesting and it's really<br>good at reasoning so if you give it<br>really hard uh programming interview<br>style problems or lead code problems it<br>can do quite quite well on them um but<br>it doesn't feel like it kind of<br>understands your rough intent as well as<br>son it<br>does like if you look at a lot of the<br>other Frontier models um one qual I have<br>is it feels like they're not necessarily<br>over I'm not saying they they train in<br>benchmarks um but they perform really<br>well in benchmarks relative to kind of<br>everything that's kind of in the middle<br>so if you tried on all these benchmarks<br>and things that are in the distribution<br>of the benchmarks they're valuated on<br>you know they'll do really well but when<br>you push them a little bit outside of<br>that son's I think the one that that<br>kind of does best at at kind of<br>maintaining that same capability like<br>you kind of have the same capability in<br>The Benchmark as when you try to<br>instruct it to do anything with coding<br>what another ridiculous question is the<br>difference between the normal<br>programming experience versus what<br>benchmarks represent like where do<br>benchmarks fall short do you think when<br>we're evaluating these models by the way<br>that's like a really really hard it's<br>like like critically important detail<br>like how how different like benchmarks<br>are versus where is like real coding<br>where real<br>coding it's not interview style coding<br>it's you're you're doing these you know<br>humans are saying like half broken<br>English sometimes and sometimes you're<br>saying like oh do what I did<br>before sometimes you're saying uh you<br>know go add this thing and then do this<br>other thing for me and then make this UI<br>element and then you know it's it's just<br>like a lot of things are sort of context<br>dependent<br>you really want to like understand the<br>human and then do do what the human<br>wants as opposed to sort of this maybe<br>the the way to put it is sort of<br>abstractly is uh the interview problems<br>are<br>very wellp<br>specified they lean a lot on<br>specification while the human stuff is<br>less<br>specified yeah I think that this this SP<br>for question is both Complicated by what<br>um Sol just mentioned and then also to<br>what Aman was getting into is that even<br>if you like you know there's this<br>problem of like the skew between what<br>can you actually model in a benchmark<br>versus uh real programming and that can<br>be sometimes hard to encapsulate because<br>it's like real programming is like very<br>messy and sometimes things aren't super<br>well specified what's correct or what<br>isn't but then uh it's also doubly hard<br>because of this public Benchmark problem<br>and that's both because public<br>benchmarks are sometimes kind of Hill<br>climbed on then it's like really really<br>hard to also get the data from the<br>public benchmarks out of the models and<br>so for instance like one of the most<br>popular like agent benchmarks sweet<br>bench um is really really contaminated<br>in the training data of uh these<br>Foundation models and so if you ask<br>these Foundation models to do a sweet<br>bench problem you actually don't give<br>them the context of a codebase they can<br>like hallucinate the right file pass<br>they can hallucinate the right function<br>names um and so the the it's it's also<br>just the public aspect of these things<br>is tricky yeah like in that case it<br>could be trained on the literal issues<br>or pool request themselves and and maybe<br>the lives will start to do a better job<br>um or they've already done a good job at<br>decontaminating those things but they're<br>not going to emit the actual training<br>data of the repository itself like these<br>are all like some of the most popular<br>python repositories like simpai is one<br>example I don't think they're going to<br>handicap their models on Senpai and all<br>these popular P python repositories in<br>order to get uh true evaluation scores<br>in these benchmarks yeah I think that<br>given the dirs and benchmarks<br>um there have been like a few<br>interesting crutches that uh places that<br>build systems with these models or build<br>these models actually use to get a sense<br>of are they going in the right direction<br>or not and uh in a lot of places uh<br>people will actually just have humans<br>play with the things and give<br>qualitative feedback on these um like<br>one or two of the foundation model<br>companies they they have people who<br>that's that's a big part of their role<br>and you know internally we also uh you<br>know qualitatively assess these models<br>and actually lean on that a lot in<br>addition to like private evals that we<br>have it's like the live<br>the vibe yeah the vi the vibe Benchmark<br>human Benchmark the hum you pull in the<br>humans to do a Vibe check yeah okay I<br>mean that's that's kind of what I do<br>like just like reading online forums and<br>Reddit and X just like well I don't know<br>how<br>to properly load in people's opinions<br>because they'll say things like I feel<br>like Claude or gpt's gotten Dumber or<br>something they'll say I feel like<br>and then I sometimes feel like that too<br>but I wonder if it's the model's problem<br>or mine yeah with Claude there's an<br>interesting take I heard where I think<br>AWS has different chips um and I I<br>suspect they've slightly different<br>numerics than uh Nvidia gpus and someone<br>speculated that claud's deg degraded<br>performance had to do with maybe using<br>the quantise version that existed on AWS<br>Bedrock versus uh whatever was running<br>on on anthropics gpus I interview a<br>bunch of people that have conspiracy<br>theories so I'm glad spoke spoke to this<br>conspiracy well it's it's not not like<br>conspiracy theory as much as they're<br>just they're like they're you know<br>humans humans are humans and there's<br>there's these details and you know<br>you're<br>doing like these quzy amount of flops<br>and you know chips are messy and man you<br>can just have bugs like bugs are it's<br>it's hard to overstate how how hard bugs<br>are to avoid what's uh the role of a<br>good prompt in all this see you mention<br>that benchmarks have<br>really uh structured well formulated<br>prompts what what should a human be<br>doing to maximize success and what's the<br>importance of what the humans you wrote<br>a blog post on you called it prompt<br>design yeah uh I think it depends on<br>which model you're using and all of them<br>are likly different and they respond<br>differently to different prompts but um<br>I think the original gp4 uh and the<br>original sort of bre of models last last<br>year they were quite sensitive to the<br>prompts and they also had a very small<br>context window and so we have all of<br>these pieces of information around the<br>codebase that would maybe be relevant in<br>the prompt like you have the docs you<br>have the files that you add you have the<br>conversation history and then there's a<br>problem like how do you decide what you<br>actually put in the prompt and when you<br>have a a limited space and even for<br>today's models even when you have long<br>context filling out the entire context<br>window means that it's slower it means<br>that sometimes a model actually gets<br>confused and some models get more<br>confused than others and we have this<br>one system internally that we call preum<br>which helps us with that a little bit um<br>and I think it was built for the era<br>before where we had<br>8,000 uh token context Windows uh and<br>it's a little bit similar to when you're<br>making a website you you sort of you you<br>want it to work on mobile you want it to<br>work on a desktop screen and you have<br>this uh Dynamic information which you<br>don't have for example if you're making<br>like designing a print magazine you have<br>like you know exactly where you can put<br>stuff but when you have a website or<br>when you have a prompt you have these<br>inputs and then you need to format them<br>will always work even if the input is<br>really big then you might have to cut<br>something down uh and and and so the<br>idea was okay like let's take some<br>inspiration what's the best way to<br>design websites well um the thing that<br>we really like is is react and the<br>declarative approach where you um you<br>use jsx in in in JavaScript uh and then<br>you declare this is what I want and I<br>think this has higher priority or like<br>this has higher Z index than something<br>else um and<br>then you have this rendering engine in<br>web design it's it's like Chrome and uh<br>in our case it's a pre renderer uh which<br>then fits everything onto the page and<br>and so you declaratively decide what you<br>want and then it figures out what you<br>want um and and so we have found that to<br>be uh quite helpful and I think the role<br>of it has has sort of shifted over time<br>um where initially was to fit to these<br>small context Windows now it's really<br>useful because you know it helps us with<br>splitting up the data that goes into the<br>prompt and the actual rendering of it<br>and so um it's easier to debug because<br>you can change the rendering of the<br>prompt and then try it on Old prompts<br>because you have the raw data that went<br>into the prompt and then you can see did<br>my change actually improve it for for<br>like this entire evil set so do you<br>literally prompt with jsx yes yes so it<br>kind of looks like react there are<br>components like we have one component<br>that's a file component and it takes in<br>like the cursor<br>like usually there's like one line where<br>the cursor is in your file and that's<br>like probably the most important line<br>because that's the one you're looking at<br>and so then you can give priorities so<br>like that line has the highest priority<br>and then you subtract one for every line<br>that uh is farther away and then<br>eventually when it's render it to figure<br>out how many lines can I actually fit<br>and it centers around that thing that's<br>amazing yeah and you can do like other<br>fancy things where if you have lots of<br>code blocks from the entire code base<br>you could use uh retrieval um and things<br>like embedding and reranking scores to<br>add priorities for each of these<br>components so should humans when they<br>ask questions also use try to use<br>something like that like would it be<br>beneficial to write jsx in the in the<br>problem where the whole idea is should<br>be loose and messy I I think our goal is<br>kind of that you should just uh do<br>whatever is the most natural thing for<br>you and then we are job is to figure out<br>how do we actually like retrieve the<br>relative EV things so that your thing<br>actually makes sense well this is sort<br>of the discussion I had with uh Arvin of<br>perplexity is like his whole idea is<br>like you should let the person be as<br>lazy as he want but like yeah that's a<br>beautiful thing but I feel like you're<br>allowed to ask more of programmers right<br>so like if you say just do what you want<br>I mean humans are lazy there's a kind of<br>tension between just being lazy versus<br>like provide more is uh be prompted<br>almost like the system<br>pressuring you or inspiring you to be<br>articulate not in terms of the grammar<br>of the sentences but in terms of the<br>depth of thoughts that you convey inside<br>the uh the problems I think even as a<br>system gets closer to some level of<br>perfection often when you ask the model<br>for something you just are not not<br>enough intent is conveyed to know what<br>to do and there are like a few ways to<br>resolve that intent one is the simple<br>thing of having model just ask you I'm<br>not sure how to do these parts based in<br>your query could you clarify that um I<br>think the other could be<br>maybe if you there are five or six<br>possible Generations given the<br>uncertainty present in your query so far<br>why don't we just actually show you all<br>of those and let you pick<br>them how hard is it to for the model to<br>choose to speak talk back sort of versus<br>gener that's a that's hard sort of like<br>how to deal with the<br>uncertainty do I do I choose to ask for<br>more information to reduce the ambiguity<br>so I mean one of the things we we do is<br>um it's like a recent addition is try to<br>suggest files that you can add so and<br>while you're typing uh one can guess<br>what the uncertainty is and maybe<br>suggest that like you know maybe maybe<br>you're writing your API<br>and uh we can guess using the<br>commits uh that you've made previously<br>in the same file that the client and the<br>server is super useful and uh there's<br>like a hard technical problem of how do<br>you resolve it across all commits which<br>files are the most important given your<br>current prompt and we still sort of uh<br>initial version is ruled out and I'm<br>sure we can make it much more<br>accurate uh it's it's it's very<br>experimental but then the ideaas we show<br>you like do you just want to add this<br>file this file this file also to tell<br>you know the model to edit those files<br>for you uh because if if you're maybe<br>you're making the API like you should<br>also edit the client and the server that<br>is using the API and the other one<br>resolving the API and so that would be<br>kind of cool as both there's the phase<br>where you're writing the prompt and<br>there's before you even click enter<br>maybe we can help resolve some of the<br>uncertainty to what degree do you use uh<br>agentic approaches how useful are agents<br>we think agents are really really cool<br>like I I I think agents is like uh it's<br>like resembles sort of like a human it's<br>sort of like the like you can kind of<br>feel that it like you're getting closer<br>to AGI because you see a demo where um<br>it acts as as a human would and and it's<br>really really cool I think um agents are<br>not yet super useful for many things<br>they I think we're we're getting close<br>to where they will actually be useful<br>and so I think uh there are certain<br>types of tasks where having an agent<br>would be really nice like I would love<br>to have an agent for example if like we<br>have a bug where you sometimes can't<br>command C and command V uh inside our<br>chat input box and that's a task that's<br>super well specified I just want to say<br>like in two sentences this does not work<br>please fix it and then I would love to<br>have an agent that just goes off does it<br>and then uh a day later I I come back<br>and I review the the thing you mean it<br>goes finds the right file yeah it finds<br>the right files it like tries to<br>reproduce the bug it like fixes the bug<br>and then it verifies that it's correct<br>and this is could be a process that<br>takes a long time um and so I think I<br>would love to have that uh and then I<br>think a lot of programming like there is<br>often this belief that agents will take<br>over all of programming um I don't think<br>we think that that's the case because a<br>lot of programming a lot of the value is<br>in iterating or you don't actually want<br>to specify something upfront because you<br>don't really know what you want until<br>youve seen an initial version and then<br>you want to iterate on that and then you<br>provide more information and so for a<br>lot of programming I think you actually<br>want a system that's instant that gives<br>you an initial version instantly back<br>and then you can iterate super super<br>quickly uh what about something like<br>that recently came out rep agent that<br>does also like setting up the<br>development environment installing<br>software packages configuring everything<br>configuring the databases and actually<br>deploying the app yeah is that also in<br>the set of things you dream about I<br>think so I think that would be really<br>cool for for certain types of<br>programming uh it it would be really<br>cool is that within scope of cursor yeah<br>we're aren't actively working on it<br>right now um but it's definitely like we<br>want to make the programmer's life<br>easier and more fun and some things are<br>just really tedious and you need to go<br>through a bunch of steps and you want to<br>delegate that to an agent um and then<br>some things you can actually have an<br>agent in the background while you're<br>working like let's say you have a PR<br>that's both backend and front end and<br>you're working in the front end and then<br>you can have a background agent that<br>doesn't work and figure out kind of what<br>you're doing and then when you get to<br>the backend part of your PR then you<br>have some like initial piece of code<br>that you can iterate on um and and so<br>that that would also be really cool one<br>of the things we already talked about is<br>speed but I wonder if we can just uh<br>Linger on that some more in the the<br>various places that uh the technical<br>details involved in making this thing<br>really fast so every single aspect of<br>cursor most aspects of cursor feel<br>really fast like I mentioned the apply<br>is probably the slowest thing and for me<br>from sorry the<br>pain I know it's it's a pain it's a pain<br>that we're feeling and we're working on<br>fixing it uh<br>yeah I mean it says something that<br>something that feels I don't know what<br>it is like 1 second or two seconds that<br>feels slow that means that's actually<br>shows that everything else is just<br>really really fast um so is there some<br>technical details about how to make some<br>of these models so how to make the chat<br>fast how to make the diffs fast is there<br>something that just jumps to mind yeah I<br>mean so we can go over a lot of the<br>strategies that we use one interesting<br>thing is Cash Waring um and so what you<br>can is if as the user is typing you can<br>have yeah you're you're probably going<br>to use uh some piece of context and you<br>can know that before the user's done<br>typing so you know as we discussed<br>before reusing the KV cache results and<br>lower latency lower cost uh cross<br>requests so as a user starts type in you<br>can immediately warm the cache with like<br>let's say the current file contents and<br>then when theyve pressed enter uh<br>there's very few tokens it actually has<br>to to prefill and compute before<br>starting the generation this will<br>significantly lower ttf can you explain<br>how KV cach works yeah so the way<br>Transformers work um I like it I<br>mean like one one of the mechanisms that<br>allow Transformers to not just<br>independently like the mechanism that<br>allows Transformers to not just<br>independently look at each token but see<br>previous tokens are the keys and values<br>to tension and generally the way tension<br>works is you have at your current token<br>some query and then you've all the keys<br>and values of all your previous tokens<br>which are some kind of representation<br>that the model stores internally of all<br>the previous tokens in the prompt<br>and like by default when you're doing a<br>chat the model has to for every single<br>token do this forward pass through the<br>entire uh model that's a lot of Matrix<br>multiplies that happen and that is<br>really really slow instead if you have<br>already done that and you stored the<br>keys and values and you keep that in the<br>GPU then when I'm let's say I have<br>stored it for the last end tokens if I<br>now want to compute the the output token<br>for the N plus one token I don't need to<br>pass those first end tokens through the<br>entire model because I already have all<br>those keys and values and so you just<br>need to do the forward pass through that<br>last token and then when you're doing<br>attention uh you're reusing those keys<br>and values that have been computed which<br>is the only kind of sequential part um<br>or sequentially dependent part of the<br>Transformer is there like higher level<br>caching of like caching of the prompts<br>or that kind of stuff could help yeah<br>that that there's other types of caching<br>you can kind of do um one interesting<br>thing that you can do for cursor tab<br>is you can basically predict ahead as if<br>the user would have accepted the<br>suggestion and then trigger another uh<br>request<br>and so then you've cashed you've done<br>the speculative it's it's a mix of<br>speculation and caching right because<br>you're speculating what would happen if<br>they accepted it and then you have this<br>value that is cach this this uh<br>suggestion and then when they press tab<br>the next one would be waiting for them<br>immediately it's a it's a kind of clever<br>heuristic slash trick uh that uses a<br>higher level caching and and can give uh<br>the it feels fast despite there not<br>actually being any changes in the in the<br>model and if you can make the KV cach<br>smaller one of the advantages you get is<br>like maybe maybe you can speculate even<br>more maybe you can get seriously 10<br>things that you know could be useful I<br>like uh like predict the next 10 and and<br>then like it's possible the user hits<br>the the one of the 10 it's like much<br>higher chance than the user hits like<br>the exact one that you show them uh<br>maybe they typeing another character and<br>and he sort of hits hits something else<br>in the cache yeah so there's there's all<br>these tricks where um the the general<br>phenomena here is uh I think it's it's<br>also super useful for RL is you know may<br>maybe a single sample from the model<br>isn't very good but if you<br>predict like 10 different things uh<br>turns out that one of the 10 uh that's<br>right is the probability is much higher<br>there's these passid key curves and you<br>know part of RL like what what RL does<br>is you know you can you can exploit this<br>passid K phenomena to to make many<br>different predictions and and uh one one<br>way to think about this the model sort<br>of knows internally has like has some<br>uncertainty over like which of the key<br>things is correct or like which of the<br>key things does the human want when we<br>ARL our uh you know cursor Tab model one<br>of the things we're doing is we're<br>predicting which like which of the<br>hundred different suggestions the model<br>produces is more amendable for humans<br>like which of them do humans more like<br>than other things uh maybe maybe like<br>there's something with the model can<br>predict very far ahead versus like a<br>little bit and maybe somewhere in the<br>middle and and you just and then you can<br>give a reward to the things that humans<br>would like more and and sort of punish<br>the things that it would like and sort<br>of then train the model to Output the<br>suggestions that humans would like more<br>you you have these like RL Loops that<br>are very useful that exploit these<br>passive K curves um Oman maybe can can<br>go into even more detail yeah it's a<br>little it is a little different than<br>speed um but I mean like technically you<br>tie it back in because you can get away<br>with the smaller model if you are all<br>your smaller model and it gets the same<br>performance as the bigger one um that's<br>like and SW I was mentioning stuff about<br>KV about reducing the size of your KV<br>cach there there are other techniques<br>there as well that are really helpful<br>for Speed um so kind of back in the day<br>like all the way two years ago uh people<br>mainly use multi-ad attention um and I<br>think there's been a migration towards<br>more uh efficient attention schemes like<br>group query um or multiquery attention<br>and this is really helpful for then uh<br>with larger batch sizes being able to<br>generate the tokens much faster the<br>interesting thing here is um this now<br>has no effect on that uh time to First<br>token pre-fill speed uh the thing this<br>matters for is uh now generating tokens<br>and and why is that because when you're<br>generating tokens instead of uh being<br>bottlenecked by doing the super<br>realizable Matrix multiplies across all<br>your tokens you're bottleneck by how<br>quickly it's for long context um with<br>large batch sizes by how quickly you can<br>read those cache keys and values um and<br>so then how that that's memory bandwidth<br>and how can we make this faster we can<br>try to compress the size of these keys<br>and values so multiquery attention is<br>the most aggressive of these um where<br>normally with multi-head attention you<br>have some number of quote unquote<br>attention heads um and some number of<br>kind of query query heads U multiquery<br>just preserves the query heads gets rid<br>of all the key value heads um so there's<br>only one kind of key value head and<br>there's all the remaining uh query heads<br>with group query um you instead you know<br>preserve all the query heads and then<br>your keys and values are kind of in<br>there are fewer heads for the keys and<br>values but you're not reducing it to<br>just one um but anyways like the whole<br>point here is you're just reducing the<br>size of your KV cache and then there is<br>MLA yeah multi- latent um that's a<br>little more complicated and the way that<br>this works is it kind of turns the<br>entirety of your keys and values across<br>all your heads into this kind of one<br>latent Vector that is then kind of<br>expanded in frence time but MLA is from<br>this company uh called Deep seek um it's<br>it's quite an interesting algorithm uh<br>maybe the key idea is sort of uh in both<br>mqa uh and in other places what you're<br>doing is sort of reducing the uh num<br>like the number of KV heads the<br>advantage you get from that is is you<br>know there's less of them but uh maybe<br>the theory is that you actually want a<br>lot of different uh like you want each<br>of the the keys and values to actually<br>be different so one way to reduce the<br>size is you keep<br>uh one big shared Vector for all the<br>keys and values and then you have<br>smaller vectors for every single token<br>so that when you m you can you can store<br>the only the smaller thing as some sort<br>of like low rank reduction and the low<br>rank reduction with that and at the end<br>of the time when you when you eventually<br>want to compute the final thing uh<br>remember that like your memory bound<br>which means that like you still have<br>some some compute left that you can use<br>for these things and so if you can<br>expand the um the latent vector<br>back out and and somehow like this is<br>far more efficient because just like<br>you're reducing like for example maybe<br>like you're reducing like 32 or<br>something like the size of the vector<br>that you're keeping yeah there's perhaps<br>some richness in having a separate uh<br>set of keys and values and query that<br>kind of pawise match up versus<br>compressing that all into<br>one and that interaction at least okay<br>and all of that is dealing with um being<br>memory bound yeah<br>and what I mean ultimately how does that<br>map to the user experience trying to get<br>the yeah the the two things that it maps<br>to is you can now make your cash a lot<br>larger because you've less space<br>allocated for the KB cash you can maybe<br>cash a lot more aggressively and a lot<br>more things do you get more cash hits<br>which are helpful for reducing the time<br>to First token for the reasons that were<br>kind of described earlier and then the<br>second being when you start doing<br>inference with more and more requests<br>and larger and larger batch sizes you<br>don't see much of a Slowdown in as it's<br>generating the tokens the speed of that<br>what it also allows you to make your<br>prompt bigger for certain yeah yeah so<br>like the basic the size of your KV cache<br>is uh both the size of all your prompts<br>multiply by the number of prompts being<br>processed in parallel so you could<br>increase either those Dimensions right<br>the batch size or the size of your<br>prompts without degrading the latency of<br>generating tokens Arvid you wrote a blog<br>post Shadow workspace iterating on code<br>in the background yeah so what's going<br>on uh so to be clear we want there to be<br>a lot of stuff stuff happening in the<br>background and we're experimenting with<br>a lot of things uh right now uh we don't<br>have much of that happening other than<br>like the the cash warming or like you<br>know figuring out the right context to<br>that goes into your command PRS for<br>example uh but the idea is if you can<br>actually spend computation in the<br>background then you can help um help the<br>user maybe like at a slightly longer<br>time Horizon than just predicting the<br>next few lines that you're going to make<br>but actually like in the next 10 minutes<br>what are you're going to make and by<br>doing it in background you can spend<br>more comp computation doing that and so<br>the idea of the Shadow workspace that<br>that we implemented and we use it<br>internally for like experiments um is<br>that to actually get advantage of doing<br>stuff in the background you want some<br>kind of feedback signal to give give<br>back to the model because otherwise like<br>you can get higher performance by just<br>letting the model think for longer um<br>and and so like o1 is a good example of<br>that but another way you can improve<br>performance is by letting the model<br>iterate and get feedback and and so one<br>very important piece of feedback when<br>you're a programmer is um the language<br>server which is uh this thing it exists<br>uh for most different languages and<br>there's like a separate language Ser per<br>language and it can tell you you know<br>you're using the wrong type appear and<br>then gives you an error or it can allow<br>you to go to definition and sort of<br>understands the structure of your code<br>so language servers are extensions<br>developed by like there's a typescript<br>language Ser developed by the typescript<br>people a rust language Ser developed by<br>the rust people and then they all inter<br>interface over the language server<br>protocol to vs code so that vs code<br>doesn't need to have all of the<br>different languages built into vs code<br>but rather uh you can use the existing<br>compiler infrastructure for linting<br>purposes what it's for it's for linting<br>it's for going to definition uh and for<br>like seeing the the right types that<br>you're using uh so it's doing like type<br>checking also yes type checking and and<br>going to references um and that's like<br>when you're working in a big project you<br>you kind of need that if you if you<br>don't have that it's like really hard to<br>to code in a big project can you say<br>again how that's being used inside<br>cursor the the language server protocol<br>communication thing so it's being used<br>in cursor to show to the programmer just<br>like nvs could but then the idea is you<br>want to show that same information to<br>the models the I models um and you want<br>to do that in a way that doesn't affect<br>the user because you wanted to do it in<br>background and so the idea behind the<br>chadow workspace was okay like one way<br>we can do this is um we spawn a separate<br>window of cursor that's hidden and so<br>you can set this flag and electron is<br>hidden there is a window but you don't<br>actually see it and inside of this<br>window uh the AI agents can modify code<br>however they want um as long as they<br>don't save it because it's still the<br>same folder um and then can get feedback<br>from from the lters and go to definition<br>and and iterate on their code so like<br>literally run everything in the<br>background like as if right yeah maybe<br>even run the code so that's the eventual<br>version okay that's what you want and a<br>lot of the blog post is actually about<br>how do you make that happen because it's<br>a little bit tricky you want it to be on<br>the user's machine so that it exactly<br>mirrors the user's environment<br>and then on Linux you can do this cool<br>thing where you can actually mirror the<br>file system and have the AI make changes<br>to the files and and it thinks that it's<br>operating on the file level but actually<br>that's stored in in memory and you you<br>can uh create this kernel extension to<br>to make it work um whereas on Mac and<br>windows it's a little bit more difficult<br>uh and and uh but it's it's a fun<br>technical problems that's way one one<br>maybe hacky but interesting idea that I<br>like is holding a lock on saving and so<br>basically you can then have the language<br>model kind of hold the lock on on saving<br>to disk and then instead of you<br>operating in the ground truth version of<br>the files uh that are save to dis you<br>you actually are operating what was the<br>shadow workspace before and these<br>unsaved things that only exist in memory<br>that you still get Lind erors for and<br>you can code in and then when you try to<br>maybe run code it's just like there's a<br>small warning that there's a lock and<br>then you kind of will take back the lock<br>from the language server if you're<br>trying to do things concurrently or from<br>the the shadow workspace if you're<br>trying to do things concurrently that's<br>such an exciting feuture by the way it's<br>a bit of a tangent but like to allow a<br>model to change files it's scary for<br>people but like it's really cool to be<br>able to just like let the agent do a set<br>of tasks and you come back the next day<br>and kind of observe like it's a<br>colleague or something like that yeah<br>yeah and I think there may be different<br>versions of like runability<br>where for the simple things where you're<br>doing things in the span of a few<br>minutes on behalf of the user as they're<br>programming it makes sense to make<br>something work locally in their machine<br>I think for the more aggressive things<br>where you're making larger changes that<br>take longer periods of time you'll<br>probably want to do this in some sandbox<br>remote environment and that's another<br>incredibly tricky problem of how do you<br>exactly reproduce or mostly reproduce to<br>the point of it being effectively<br>equivalent for running code the user's<br>environment which is remote remote<br>sandbox I'm curious what kind of Agents<br>you want for for coding oh do you want<br>them to find bugs do you want them to<br>like Implement new features like what<br>agents do you want so by the way when I<br>think about agents I don't think just<br>about coding uh I think so for the<br>practic this particular podcast there's<br>video editing and a lot of if you look<br>in Adobe a lot there's code behind uh<br>it's very poorly documented code but you<br>can interact with premiere for example<br>using code and basically all the<br>uploading everything I do on YouTube<br>everything as you could probably imagine<br>I do all of that through code and so and<br>including translation and overdubbing<br>all this so I Envision all those kinds<br>of tasks so automating many of the tasks<br>that don't have to do directly with the<br>editing so that okay that's what I was<br>thinking about but in terms of coding I<br>would be fundamentally thinking about<br>bug<br>finding like many levels of kind of bug<br>finding and also bug finding like<br>logical bugs not logical like spiritual<br>bugs or<br>something one's like sort of big<br>directions of implementation that kind<br>of stuff that's Bine on Buck finding<br>yeah I mean it's really interesting that<br>these models are so bad at bug finding<br>uh when just naively prompted to find a<br>bug they're incredibly poorly calibrated<br>even the the smartest models exactly<br>even o even 01 how do you explain that<br>is there a good<br>intuition I think these models are a<br>really strong reflection of the<br>pre-training distribution and you know I<br>do think they they generalize as the<br>loss gets lower and lower but I don't<br>think the the loss and the scale is<br>quite or the loss is low enough such<br>that they're like really fully<br>generalizing in code like the things<br>that we use these things for uh the<br>frontier models that that they're quite<br>good at are really code generation and<br>question answering these things exist in<br>massive quantities and pre-training with<br>all of the code on GitHub on the scale<br>of many many trillions of tokens and<br>questions and answers on things like<br>stack Overflow and maybe GitHub issues<br>and so when you try to push some of<br>these things that really don't exist uh<br>very much online like for example the<br>cursor tap objective of predicting the<br>next edit given the edit's done so far<br>uh the brittleness kind of shows and<br>then bug detection is another great<br>example where there aren't really that<br>many examples of like actually detecting<br>real bugs and then proposing fixes um<br>and the models just kind of like really<br>struggle at it but I think it's a<br>question of transferring the model like<br>in the same way that you get this<br>fantastic transfer um from pre-trained<br>Models uh just on code in general to the<br>cursor tab objective uh you'll see a<br>very very similar thing with generalized<br>models that are really good to code to<br>bug detection it just takes like a<br>little bit of kind of nudging in that<br>direction like to be clear I think they<br>sort of understand code really well like<br>while they're being pre-trained like the<br>representation that's being built up<br>like almost certainly like you know<br>Somewhere In The Stream there's the<br>model knows that maybe there's there's<br>some SK something sketchy going on right<br>it sort of has some sketchiness but<br>actually eliciting this the sketchiness<br>to uh like actually like part part of it<br>is that humans are really calibrated on<br>which bugs are really important it's not<br>just actually it's not just actually<br>saying like there's something sketchy<br>it's like it's just sketchy trivial it's<br>the sketchy like you're going to take<br>the server down it's like like part of<br>it is maybe the cultural knowledge of uh<br>like why is a staff engineer a staff<br>engineer a staff engineer is is good<br>because they know that three years ago<br>like someone wrote a really you know<br>sketchy piece of code that took took the<br>server down and as opposed to like as<br>supposed to maybe it's like you know you<br>just this thing is like an experiment so<br>like a few bugs are fine like you're<br>just trying to experiment and get the<br>feel of the thing and so if the model<br>gets really annoying when you're writing<br>an experiment that's really bad but if<br>you're writing something for super<br>production you're like writing a<br>database right you're you're writing<br>code in post scripts or Linux or<br>whatever like your lineus tals you're<br>you're it's sort of unacceptable to have<br>even a edge case and just having the<br>calibration of<br>like how paranoid is the user like but<br>even then like if you're putting in a<br>maximum paranoia it still just like<br>doesn't quite get it yeah yeah yeah I<br>mean but this is hard for humans too to<br>understand what which line of code is<br>important which is not it's like you I<br>think one of your principles on a<br>website says if if if a code can do a<br>lot of<br>damage one should add a comment that say<br>this this this line of code is is<br>dangerous and all<br>caps 10 times no you say like for every<br>single line of code inside the function<br>you have to and that's quite profound<br>that says something about human beings<br>because the the engineers move on even<br>the same person might just forget how it<br>can sync the Titanic a single function<br>like you don't you might not in it that<br>quite clearly by looking at the single<br>piece of code yeah and I think that that<br>one is also uh partially also for<br>today's AI models where uh if you<br>actually write dangerous dangerous<br>dangerous in every single line like uh<br>the models will pay more attention to<br>that and will be more likely to find<br>bucks in that region that's actually<br>just straight up a really good practice<br>of a labeling code of how much damage<br>this can do yeah I mean it's<br>controversial like some people think<br>it's ugly uh swall well I actually think<br>it's it's like in fact I actually think<br>this one of the things I learned from AR<br>is you know like I sort of aesthetically<br>I don't like it but I think there's<br>certainly something where like it's it's<br>useful for the models and and humans<br>just forget a lot and it's really easy<br>to make a small mistake and cause<br>like bring down you know like just bring<br>down the server and like you like of<br>course we we like test a lot and<br>whatever but there there's always these<br>things that you have to be very careful<br>yeah like with just normal dock strings<br>I think people will often just skim it<br>when making a change and think oh this I<br>I know how to do this um and you kind of<br>really need to point it out to them so<br>that that doesn't slip through<br>yeah you have to be reminded that you<br>could do a lot of<br>damage that's like we don't really think<br>about that like yeah you think about<br>okay how do I figure out how this work<br>so I can improve it you don't think<br>about the other direction that could<br>until until we have formal verification<br>for everything then you can do whatever<br>you want and you you know for certain<br>that you have not introduced a bug if<br>the proof passes but concretely what do<br>you think that future would look like I<br>think um people will just write tests<br>anymore and um the model will suggest<br>like you write a function the model will<br>suggest a spec and you review the spec<br>and uh in the meantime a smart reasoning<br>model computes appr proof that the<br>implementation follows the spec um and I<br>think that happens for for most<br>functions don't you think this gets at a<br>little bit some of the stuff you were<br>talking about earlier with the<br>difficulty of specifying intent for what<br>you want with software um where<br>sometimes it might be because the intent<br>is really hard to specify it's also then<br>going to be really hard to prove that<br>it's actually matching whatever your<br>intent is like you think that spec is<br>hard to<br>generate yeah or just like for a given<br>spec maybe you can I think there is a<br>question of like can you actually do the<br>formal verification like that's like is<br>that possible I think that there's like<br>more to dig into there but then also<br>even if you have this spe if you have<br>this spe how do you you have the spec is<br>the spec written in natural<br>language the spec spec would be formal<br>but how easy would that be so then I<br>think that you care about things that<br>are not going to be easily well<br>specified in the spec language I see I<br>see would be um yeah maybe an argument<br>against formal verification is all you<br>need yeah the worry is there's this<br>massive document replacing replacing<br>something like unitest sure yeah yeah um<br>I think you can probably also evolve the<br>the spec languages to capture some of<br>the things that they don't really<br>capture right now um but yeah I don't<br>know I think it's very exciting and<br>you're speaking not just about like<br>single functions you're speaking about<br>entire code bases I think entire code<br>bases is harder but that that is what I<br>would love to have and I think it should<br>be possible and because you can even<br>there there's like a lot of work<br>recently where uh you can prove formally<br>verify down to the hardware so like<br>through the you formally verify the C<br>code and then you formally verify<br>through the GCC compiler and then<br>through the VAR log down to the hardware<br>um and that's like incredibly big system<br>but it actually works and I think big<br>code bases are are sort of similar in<br>that they're like multi-layered system<br>and um if you can decompose it and<br>formally verify each part then I think<br>it should be possible I think the<br>specification problem is a real problem<br>but how do you handle side effects or<br>how do you handle I guess external<br>dependencies like calling the stripe API<br>maybe stripe would write a spec for<br>their you can't do this for everything<br>like can you do this for everything you<br>use like how do you how do you do it for<br>if there's language mod like maybe maybe<br>like people use language models as<br>Primitives in the programs they write<br>and there's like a dependence on it and<br>like how how do you now include that I<br>think you might be able to prove prove<br>that still prove what about language<br>models I think it it feels possible that<br>you could actually prove that a language<br>model is aligned for example or like you<br>can prove that it actually gives the the<br>right answer um that's the dream yeah<br>that is I mean that's if it's possible<br>your I Have a Dream speech if it's<br>possible that that will certainly help<br>with you know uh making sure your code<br>doesn't have bugs and making sure AI<br>doesn't destroy all of human<br>civilization so the the full spectrum of<br>AI safety to just bug finding uh so you<br>said the models struggle with bug<br>finding what's the Hope You Know My Hope<br>initially is and and I can let Michael<br>Michael chime into to it but was like<br>this<br>um it should you know first help with<br>the stupid bugs like it should very<br>quickly catch the stupid bugs like off<br>by one erors like sometimes you write<br>something in a comment and do the other<br>way it's like very common like I do this<br>I write like less than in a comment and<br>like I maybe write it greater than or<br>something like that and the model is<br>like yeah it looks sketchy like you sure<br>you want to do that uh but eventually it<br>should be able to catch 100 bucks too<br>yeah and I think that it's also<br>important to note that this is having<br>good bug finding models feels necessary<br>to get to the highest reaches of having<br>AI do more and more programming for you<br>where you're going to you know if the AI<br>is building more and more of the system<br>for you you need to not just generate<br>but also verify and without that some of<br>the problems that we've talked about<br>before with programming with these<br>models um will just become untenable um<br>so it's not just for humans like you<br>write a bug I write a bug find the bug<br>for me but it's also being able to to<br>verify the AI code and check it um is<br>really important yeah and then how do<br>you actually do this like we have had a<br>lot of contentious dinner discussions of<br>how do you actually train a bug model<br>but one very popular idea is you know<br>it's kind of potentially easy to<br>introduce a bug than actually finding<br>the bug and so you can train a model to<br>introduce bugs in existing code um and<br>then you can train a reverse bug model<br>then that uh can find find bugs using<br>this synthetic data so that's like one<br>example um but yeah there are lots of<br>ideas for how to also um you can also do<br>a bunch of work not even at the model<br>level of taking the biggest models and<br>then maybe giving them access to a lot<br>of information that's not just the code<br>like it's kind of a hard problem to like<br>stare at a file and be like where's the<br>bug and you know that's that's hard for<br>humans often right and so often you have<br>to to run the code and being able to see<br>things like traces and step through a<br>debugger um there's another whole<br>another Direction where it like kind of<br>tends toward that and it could also be<br>that there are kind of two different<br>product form factors here it could be<br>that you have a really specialty model<br>that's quite fast that's kind of running<br>in the background and trying to spot<br>bugs and it might be that sometimes sort<br>of to arvid's earlier example about you<br>know some nefarious input box bug might<br>be that sometimes you want to like<br>there's you know there's a bug you're<br>not just like checking hypothesis free<br>you're like this is a problem I really<br>want to solve it and you zap that with<br>tons and tons and tons of compute and<br>you're willing to put in like $50 to<br>solve that bug or something even more<br>have you thought about integrating money<br>into this whole thing like I would pay<br>probably a large amount of money for if<br>you found a bug or even generated a code<br>that I really appreciated like I had a<br>moment a few days ago when I started<br>using C were<br>generated uh<br>perfect uh like perfect three functions<br>for interacting with the YouTube API to<br>update captions and uh for localization<br>like different in different languages<br>the API documentation is not very good<br>and the code across like if I I Googled<br>it for a while I couldn't find exactly<br>there's a lot of confusing information<br>and cursor generated perfectly and I was<br>like I just said back I read the code I<br>was like this is correct I tested it<br>it's correct I was like I want a tip on<br>a on a button that goes yeah here's $5<br>one that's really good just to support<br>the company and support what the the<br>interface is and the other is that<br>probably sends a strong signal like good<br>job right so there much stronger signal<br>than just accepting the code right you<br>just actually send like a strong good<br>job that and for bug finding obviously<br>like there's a lot of people<br>you know that would pay a huge amount of<br>money for a bug like a bug bug Bounty<br>thing right is that you guys think about<br>that yeah it's a controversial idea<br>inside the the company I think it sort<br>of depends on how much uh you believe in<br>humanity almost you know like uh I think<br>it would be really cool if like uh you<br>spend nothing to try to find a bug and<br>if it doesn't find a bug you you spend Z<br>and then if it does find a bug uh and<br>you click accept then it also shows like<br>in parenthesis like $1 and so you spend<br>$1 to accept a bug uh and then of course<br>there's worry like okay we spent a lot<br>of computation like maybe people will<br>just copy paste um I think that's a<br>worry um and then there is also the<br>worry that like introducing money into<br>the product makes it like kind of you<br>know like it doesn't feel as fun anymore<br>like you have to like think about money<br>and and you all you want to think about<br>is like the code and so maybe it<br>actually makes more sense to separate it<br>out and like you pay some fee like every<br>month and then you get all of these<br>things for free but there could be a<br>tipping component which is not like it<br>it it still has that like dollar symbol<br>I think it's fine but I I also see the<br>point where like maybe you don't want to<br>introduce it yeah I was going to say the<br>moment that feels like people do this is<br>when they share it when they have this<br>fantastic example they just kind of<br>share it with their friends there is<br>also a potential world where there's a<br>technical solution to this like honor<br>System problem too where if we can get<br>to a place where we understand the<br>output of the system more I mean to the<br>stuff we were talking about with like<br>you know error checking with the LSP and<br>then also running the code but if you<br>could get to a place where you could<br>actually somehow verify oh I have fixed<br>the bug maybe then the the bounty system<br>doesn't need to rely on the honor System<br>Too how much interaction is there<br>between the terminal and the code like<br>how much information is gained from if<br>you if you run the code in the terminal<br>like can you use can you do like a a<br>loop where it runs runs the code and<br>suggests how to change the code if if<br>the code and runtime gives an error is<br>right now there're separate worlds<br>completely like I know you can like do<br>control K inside the terminal to help<br>you write the code you you can use<br>terminal contacts as well uh inside of<br>Jack man kind of everything um we don't<br>have the looping part yet though we<br>suspect something like this could make a<br>lot of sense there's a question of<br>whether it happens in the foreground too<br>or if it happens in the background like<br>what we've been discussing sure the<br>background is pretty cool like we do<br>running the code in different ways plus<br>there's a database side to this which<br>how do you protect it from not modifying<br>the database but<br>okay I mean there's there's certainly<br>cool Solutions there uh there's this new<br>API that is being developed for it's<br>it's not in AWS uh but you know it's it<br>certainly it's I think it's in Planet<br>scale I don't know if Planet scale was<br>the first one you added it's the ability<br>sort of add branches to a database uh<br>which is uh like if you're working on a<br>feature and you want to test against the<br>prod database but you don't actually<br>want to test against the pr database you<br>could sort of add a branch to the<br>database in the way to do that is to add<br>a branch to the WR ahead log uh and<br>there's obviously a lot of technical<br>complexity in doing it correctly I I<br>guess database companies need need need<br>new things to do uh because they have<br>they have they have good databases now<br>uh and and I I think like you know turbo<br>buffer which is which is one of the<br>databases we use as is is going to add<br>hope maybe braning to the to the rad log<br>and and so so maybe maybe the the AI<br>agents will use we'll use branching<br>they'll like test against some branch<br>and it's sort of going to be a<br>requirement for the database to like<br>support branching or something it would<br>be really interesting if you could<br>Branch a file system right yeah I feel<br>like everything needs branching it's<br>like that yeah yeah like that's the<br>problem with the Multiverse<br>[Music]<br>right like if you branch on everything<br>that's like a lot I mean there's there's<br>obviously these like super clever<br>algorithms to make sure that you don't<br>actually sort of use a lot of space or<br>CPU or whatever okay this is a good<br>place to ask about infrastructure so you<br>guys mostly use AWS what what are some<br>interesting details what are some<br>interesting challenges why' you choose<br>AWS why is why is AWS still winning<br>hashtag AWS is just really really good<br>it's really good like um whenever you<br>use an AWS product you just know that<br>it's going to work like it might be<br>absolute hell to go through the steps to<br>set it up um why is the interface so<br>horrible because it's just so good it<br>doesn't need to the nature of<br>winning I think it's exactly it's just<br>nature they winning yeah yeah but AWS<br>you can always trust like it will always<br>work and if there is a problem it's<br>probably your<br>problem yeah okay is there some<br>interesting like challenges to you guys<br>have pretty new startup to get scaling<br>to like to so many people and yeah I<br>think that they're uh it has been an<br>interesting Journey adding you know each<br>extra zero to the request per second you<br>run into all of these with like you know<br>the general components you're using for<br>for caching and databases run into<br>issues as you make things bigger and<br>bigger and now we're at the scale where<br>we get like you know int overflows on<br>our tables and things like that um and<br>then also there have been some custom<br>systems that we've built like for<br>instance our Ral system for um Computing<br>a semantic index of your codebase and<br>answering questions about a codebase<br>that have continually I feel like been<br>one of the the trickier things to scale<br>I I have a few friends who are who are<br>super super senior engineers and one of<br>their sort of lines is like it's it's<br>very hard to predict where systems will<br>break when when you scale them you you<br>you can sort of try to predict in<br>advance but like there's there's always<br>something something weird that's going<br>to happen when when you add this extra Z<br>and you you thought you thought through<br>everything but you didn't actually think<br>through everything uh but I think for<br>that particular system<br>we've so what the the for concrete<br>details the thing we do is obviously we<br>upload um when like we chunk up all of<br>your code and then we send up sort of<br>the code for for embedding and we embed<br>the code and then we store the<br>embeddings uh in a in a database but we<br>don't actually store any of the code and<br>then there's reasons around making sure<br>that<br>we don't introduce client bugs because<br>we're very very paranoid about client<br>bugs we store uh uh much of the details<br>on the server uh like everything is sort<br>of<br>encrypted so one one of the technical<br>challenges is is always making sure that<br>the local index the local codebase state<br>is the same as the state that is on the<br>server and and the way sort of<br>technically we ended up doing that is so<br>for every single file you can you can<br>sort of keep this hash and then for<br>every folder you can sort of keep a hash<br>which is the hash of all of its children<br>and you can sort of recursively do that<br>until the top and why why do something<br>something complicated uh one thing you<br>could do is you could keep a hash for<br>every file then every minute you could<br>try to download the hashes that are on<br>the server figure out what are the files<br>that don't exist on the server maybe<br>just created a new file maybe you just<br>deleted a file maybe you checked out a<br>new branch and try to reconcile the<br>state between the client and the<br>server but that introduces like<br>absolutely ginormous Network overhead<br>both uh both on the client side I mean<br>nobody really wants us to hammer their<br>Wi-Fi all the time if you're using<br>cursor uh but also like I mean it would<br>introduce like ginormous overhead in the<br>database it would sort of be reading<br>this uh tens of terabyte database sort<br>of approaching like 20 terabyt or<br>something database like every second<br>that's just just kind of crazy you<br>definitely don't want to do that so what<br>you do you sort of you just try to<br>reconcile the single hash which is at<br>the root of the project and then if if<br>something mismatches then you go you<br>find where all the things disagree maybe<br>you look at the children and see if the<br>hashes match and if the hashes don't<br>match go look at their children and so<br>on but you only do that in the scenario<br>where things don't match and for most<br>people most of the time the hashes match<br>so it's a kind of like hierarchical<br>reconciliation yeah something like that<br>yeah it's called the Merkel tree yeah<br>Merkel yeah I mean so yeah it's cool to<br>see that you kind of have to think<br>through all these problems and I mean<br>the the point of like the reason it's<br>gotten hard is just because like the<br>number of people using it and you know<br>if some of your customers have really<br>really large code bases uh to the point<br>where we you know we we originally<br>reordered our code base which is which<br>is big but I mean just just not the size<br>of some company that's been there for 20<br>years and sort of has to train enormous<br>number of files and you sort of want to<br>scale that across programmers there's<br>there's all these details where like<br>building the simple thing is easy but<br>scaling it to a lot of people like a lot<br>of companies is is obviously a difficult<br>problem which is sort of you know<br>independent of actually so that's<br>there's part of this scaling our current<br>solution is also you know coming up with<br>new ideas that obviously we're working<br>on uh but then but then scaling all of<br>that in the last few weeks once yeah and<br>there are a lot of clever things like<br>additional things that that go into this<br>indexing system<br>um for example the bottleneck in terms<br>of costs is not storing things in the<br>vector database or the database it's<br>actually embedding the code and you<br>don't want to Reed the code base for<br>every single person in a company that is<br>using the same exact code except for<br>maybe they're in a different branch with<br>a few different files or they've made a<br>few local changes and so because again<br>embeddings are the bottleneck you can do<br>this one clever trick and not have to<br>worry about like the complexity of like<br>dealing with branches and and the other<br>databases where you just have some cash<br>on<br>the actual vectors uh computed from the<br>hash of a given chunk MH and so this<br>means that when the nth person at a<br>company goes into their code base it's<br>it's really really fast and you do all<br>this without actually storing any code<br>on our servers at all no code data<br>stored we just store the vectors in the<br>vector database and the vector cache<br>what's the biggest gains at this time<br>you get from indexing the code base like<br>just out of curiosity like what<br>what benefit users have it seems like<br>longer term there'll be more and more<br>benefit but in the short term just<br>asking questions of the code<br>base uh what what's the use what's the<br>usefulness of that I think the most<br>obvious one is um just you want to find<br>out where something is happening in your<br>large code base and you sort of have a<br>fuzzy memory of okay I want to find the<br>place where we do X um but you don't<br>exactly know what to search for in a<br>normal text search and to ask a chat uh<br>you hit command enter to ask with with<br>the codebase chat and then uh very often<br>it finds the the right place that you<br>were thinking of I think like you like<br>you mentioned in the future I think this<br>only going to get more and more powerful<br>where we're working a lot on improving<br>the quality of our retrieval um and I<br>think the cealing for that is really<br>really much higher than people give a<br>credit for one question that's good to<br>ask here have you considered and why<br>haven't you much done sort of local<br>stuff to where you can do the it seems<br>like everything we just discussed is<br>exceptionally difficult to do to go to<br>go to the cloud you have to think about<br>all these things with the caching and<br>the<br>uh you know large code Bas with a large<br>number of programmers are using the same<br>code base you have to figure out the<br>puzzle of that a lot of it you know<br>most software just does stuff this heavy<br>computational stuff locally so if you<br>consider doing sort of embeddings<br>locally yeah we thought about it and I<br>think it would be cool to do it locally<br>I think it's just really hard and and<br>one thing to keep in mind is that you<br>know uh some of our users use the latest<br>MacBook Pro uh and but most of our users<br>like more than 80% of our users are in<br>Windows machines which uh and and many<br>of them are are not very powerful and<br>and so local models really only works on<br>the on the latest computers and it's<br>also a big overhead to to to build that<br>in and so even if we would like to do<br>that um it's currently not something<br>that we are able to focus on and I think<br>there there are some uh people that that<br>that do that and I think that's great um<br>but especially as models get bigger and<br>bigger and you want to do fancier things<br>with like bigger models it becomes even<br>harder to do it locally yeah and it's<br>not a problem of like weaker computers<br>it's just that for example if you're<br>some big company you have big company<br>code base it's just really hard to<br>process big company code based even on<br>the beefiest MacBook Pros so even if<br>it's not even a matter matter of like if<br>you're if you're just like uh a student<br>or something I think if you're like the<br>best programmer at at a big company<br>you're still going to have a horrible<br>experience if you do everything locally<br>when you could you could do it and sort<br>of scrape by but like again it wouldn't<br>be fun anymore yeah like at approximate<br>nearest neighbors and this massive code<br>base is going to just eat up your memory<br>and your CPU and and and that's and<br>that's just that like let's talk about<br>like also the modeling side where said<br>there are these massive headwinds<br>against uh local models where one uh<br>things seem to move towards Moes which<br>like one benefit is maybe they're more<br>memory bandwidth bound which plays in<br>favor of local uh versus uh using gpus<br>um or using Nvidia gpus but the downside<br>is these models are just bigger in total<br>and you know they're going to need to<br>fit often not even on a single node but<br>multiple nodes um there's no way that's<br>going to fit inside of even really good<br>MacBooks um and I think especially for<br>coding it's not a question as much of<br>like does it clear some bar of like the<br>model's good enough to do these things<br>and then like we're satisfied which may<br>may be the case for other other problems<br>and maybe where local models shine but<br>people are always going to want the best<br>the most intelligent the most capable<br>things and that's going to be really<br>really hard to run for almost all people<br>locally don't you want the the most<br>capable model like you want you want<br>Sonet you and also with o I like how<br>you're pitching<br>me1 would you be satisfied with an<br>inferior model listen I yeah I'm yes I'm<br>one of those but there's some people<br>that like to do stuff locally especially<br>like yeah really there's a whole<br>obviously open source movement that kind<br>of resists and it's good that they exist<br>actually because you want to resist the<br>power centers that are growing are<br>there's actually an alternative to local<br>models uh that I particularly fond of uh<br>I think it's still very much in the<br>research stage but you could imagine um<br>to do homomorphic encryption for<br>language model inference so you encrypt<br>your input on your local machine then<br>you send that up and then um the server<br>uh can use lots of computation they can<br>run models that you cannot run locally<br>on this encrypted data um but they<br>cannot see what the data is and then<br>they send back the answer and you<br>decrypt the answer and only you can see<br>the answer uh so I think uh that's still<br>very much research and all of it is<br>about trying to make the overhead lower<br>because right now the overhead is really<br>big uh but if you can make that happen I<br>think that would be really really cool<br>and I think it would be really really<br>impactful um because I think one thing<br>that's actually kind of worrisome is<br>that as these models get better and<br>better uh they're going to become more<br>and more economically useful and so more<br>and more of the world's information and<br>data uh will th flow through you know<br>one or two centralized actors um and<br>then there are worries about you know<br>there can be traditional hacker attempts<br>but it also creates this kind of scary<br>part where if all of the world's<br>information is flowing through one node<br>in PL text um you can have surveillance<br>in very bad ways and sometimes that will<br>happen for you know in initially will be<br>like good reasons like people will want<br>to try to prot protect against like bad<br>Act using AI models in bad ways and then<br>you will add in some surveillance code<br>and then someone else will come in and<br>you know you're in a slippery slope and<br>then you start uh doing bad things with<br>a lot of the world's data and so I I'm<br>very hopeful that uh we can solve<br>homomorphic encryption for doing privacy<br>preserving machine learning but I would<br>say like that's the challenge we have<br>with all software these days it's<br>like there's so many features that can<br>be provided from the cloud and all of us<br>increasingly rely on it and make our<br>life awesome but there's downsides and<br>that's that's why you rely on really<br>good security to protect from basic<br>attacks but there's also only a small<br>set of companies that are controlling<br>that data you know and they they<br>obviously have leverage and they could<br>be infiltrated in all kinds of ways<br>that's the world we live in yeah I mean<br>the thing I'm just actually quite<br>worried about is sort of the world where<br>mean entropic has this responsible<br>scaling policy and so where we're on<br>like the low low asls which is the<br>entropic security level or whatever uh<br>of like of the models but as we get your<br>like cod and code ASL 3L 4 whatever<br>models uh which are sort of very<br>powerful<br>but for for mostly reasonable security<br>reasons you would want to monitor all<br>the prompts uh but I think I think<br>that's that's sort reasonable and<br>understandable where where everyone is<br>coming from but man it'd be really<br>horrible if if sort of like all the<br>world's information is sort of monitor<br>that heavily it's way too centralized<br>it's like it's like sort of this like<br>really fine line you're walking where on<br>the one side like you don't want the<br>models to go Rogue on the other side<br>like man humans like I I don't know if I<br>if I trust like all the world's<br>information to pass through like three<br>three model providers yeah why do you<br>think it's different than Cloud<br>providers because I<br>think the this is a lot of this data<br>would never have gone to the cloud<br>providers in the in the first place um<br>where this is often like you want to<br>give more data to the eio models you<br>want to give personal data that you<br>would never have put online in the first<br>place uh to these companies or or or to<br>these models um and it also centralizes<br>control uh where right now um for for<br>cloud you can often use your own<br>encryption keys and it like it can't<br>really do much um but here it's just<br>centralized actors that see the exact<br>plain text of<br>everything on the topic of context that<br>that's actually been a friction for me<br>when I'm writing code you know in Python<br>there's a bunch of stuff imported<br>there's a you could probably int it the<br>kind of stuff I would like to include in<br>the context is there like how how hard<br>is it to Auto figure out the<br>context It's Tricky um I think we can do<br>a lot better um at uh Computing the<br>context automatically in the future one<br>thing that's important to not is there<br>are trade-offs with including automatic<br>context so the more context you include<br>for these models um first of all the<br>slower they are and um the more<br>expensive those requests are which means<br>you can then do less model calls and do<br>less fancy stuff in the background also<br>for a lot of these models they get<br>confused if you have a lot of<br>information in the prompt so the bar for<br>um accuracy and for relevance of the<br>context you include should be quite High<br>um but this is already we do some<br>automatic context in some places within<br>the product it's definitely something we<br>want to get a lot better at and um I<br>think that there are a lot of cool ideas<br>to try there um both on the learning<br>better retrieval systems like better<br>edding models better rankers I think<br>that there are also cool academic ideas<br>you know stuff we've tried out<br>internally but also the field is<br>grappling with RIT large about can you<br>get language models to a place where you<br>can actually just have the model itself<br>like understand a new Corpus of<br>information and the most popular talked<br>about version of this is can you make<br>the context Windows infinite then if you<br>make the context Windows infinite can<br>make the model actually pay attention to<br>the infinite context and then after you<br>can make it pay attention to the<br>infinite context to make it somewhat<br>feasible to actually do it can you then<br>do caching for that infinite context you<br>don't have to recompute that all the<br>time but there are other cool ideas that<br>are being tried that are a little bit<br>more analogous to fine-tuning of<br>actually learning this information and<br>the weights of the model and it might be<br>that you actually get sort of a<br>qualitatively different type of<br>understanding if you do it more at the<br>weight level than if you do it at the<br>Inc context learning level I think the<br>journey the jury is still a little bit<br>out on how this is all going to work in<br>the end uh but in the interm US us as a<br>company we are really excited about<br>better retrieval systems and um picking<br>the parts of the code base that are most<br>relevant to what you're doing uh we<br>could do that a lot better like one<br>interesting proof of concept for the<br>learning this knowledge directly in the<br>weights is with vs code so we're in a vs<br>code fork and vs code the code is all<br>public so these models in pre-training<br>have seen all the code um they probably<br>also seen questions and answers about it<br>and then they've been fine tuned and RL<br>Chef to to be able to answer questions<br>about code in general so when you ask it<br>a question about vs code you know<br>sometimes it'll hallucinate but<br>sometimes it actually does a pretty good<br>job at answering the question and I<br>think like this is just by it happens to<br>be okay at it but what if you could<br>actually like specifically train or Post<br>train a model such that it really was<br>built to understand this code base um<br>it's an open research question one that<br>we're quite interested in and then<br>there's also uncertainty of like do you<br>want the model to be the thing that end<br>to end is doing everything I.E it's<br>doing the retrieval in its internals and<br>then kind of answering your question<br>creating the code or do you want to<br>separate the retrieval from the Frontier<br>Model where maybe you know you'll get<br>some really capable models that are much<br>better than like the best open source<br>ones in a handful of months um and then<br>you'll want to separately train a really<br>good open source model to be the<br>retriever to be the thing that feeds in<br>the context um to these larger models<br>can you speak a little more to the post<br>trining a model to understand the code<br>base like what do you what do you mean<br>by that with is this synthetic data<br>direction is this yeah I mean there are<br>many possible ways you could try doing<br>it there's certainly no shortage of<br>ideas um it's just a question of going<br>in and like trying all of them and being<br>empirical about which one works best um<br>you know one one very naive thing is to<br>try to replicate What's Done uh with<br>vscode uh and these Frontier models so<br>let's like continue pre-training some<br>kind of continued pre-training that<br>includes General code data but also<br>throws in a lot of the data of some<br>particular repository that you care<br>about and then in post trainining um<br>meaning in let's just start with<br>instruction fine tuning you have like a<br>normal instruction fine tuning data set<br>about code then you throw in a lot of<br>questions about code in that repository<br>um so you could either get ground truth<br>ones which might be difficult or you<br>could do what you kind of hinted at or<br>suggested using synthetic data um I.E<br>kind of having the model uh ask<br>questions about various re pieces of the<br>code um so you kind of take the pieces<br>of the code then prompt the model or<br>have a model propose a question for that<br>piece of code and then add those as<br>instruction find Uni data points and<br>then in theory this might unlock the<br>models ability to answer questions about<br>that code base let me ask you about open<br>ai1 what do you think is the role of<br>that kind of test time compute system in<br>programming I think test time compute is<br>really really interesting so there's<br>been the pre-training regime which will<br>kind of as you scale up the amount of<br>data and the size of your model get you<br>better and better performance both on<br>loss and then on Downstream benchmarks<br>um and just general performance when we<br>use it for coding or or other tasks um<br>we're starting to hit uh a bit of a data<br>wall meaning it's going to be hard to<br>continue scaling up this regime and so<br>scaling up 10 test time compute is an<br>interesting way of now you know<br>increasing the number of inference time<br>flops that we use but still getting like<br>uh like yeah as you increase the number<br>of flops use inference time getting<br>corresponding uh improvements in in the<br>performance of these models<br>traditionally we just had to literally<br>train a bigger model that always uses uh<br>that always used that many more flops<br>but now we could perhaps use the same<br>siiz model um and run it for longer to<br>be able to get uh an answer at the<br>quality of a much larger model and so<br>the really interesting thing I like<br>about this is there are some problems<br>that perhaps require<br>100 trillion parameter model<br>intelligence trained on 100 trillion<br>tokens um but that's like maybe 1% maybe<br>like 0.1% of all queries so are you<br>going to spend all of this effort all<br>this compute training a model uh that<br>cost that much and then run it so<br>infrequently it feels completely<br>wasteful when instead you get the model<br>that can that is that you train the<br>model that's capable of doing the 99.9%<br>of queries then you have a way of<br>inference time running it longer for<br>those few people that really really want<br>Max<br>intelligence how do you figure out which<br>problem requires what level of<br>intelligence is that possible to<br>dynamically figure out when to use GPT 4<br>when to use like when to use a small<br>model and when you need the the<br>01 I mean yeah that's that's an open<br>research problem certainly uh I don't<br>think anyone's actually cracked this<br>model routing problem quite well uh we'd<br>like to we we have like kind of initial<br>implementations of this for things for<br>something like cursor tab um but at the<br>level of like going between 40 Sonet<br>to1 uh it's a bit trickier perh like<br>there's also a question of like what<br>level of intelligence do you need to<br>determine if the thing is uh too hard<br>for for the the four level model maybe<br>you need the 01 level model um it's<br>really unclear but but you mentioned so<br>there's a there's there's a pre-training<br>process then there's Pro post training<br>and then there's like test time compute<br>that fair does sort of separate where's<br>the biggest gains um well it's weird<br>because like test time compute there's<br>like a whole training strategy needed to<br>get test time compute to work and the<br>Really the other really weird thing<br>about this is no one like outside of the<br>big labs and maybe even just open AI no<br>one really knows how it works like there<br>have been some really interesting papers<br>that uh show hints of what they might be<br>doing and so perhaps they're doing<br>something with research using process<br>reward models but yeah I just I think<br>the issue is we don't quite know exactly<br>what it looks like so it would be hard<br>to kind of comment on like where it fits<br>in I I would put it in post training but<br>maybe like the compute spent for this<br>kind of for getting test time compute to<br>work for a model is going to dwarf<br>pre-training<br>eventually so we don't even know if 0an<br>is using just like Chain of Thought RL<br>we don't know how they're using any of<br>these we don't know anything it's fun to<br>speculate like if you were to uh build a<br>competing model what would you do yeah<br>so one thing to do would be I I think<br>you probably need to train a process<br>reward model which is so maybe we can<br>get into reward models and outcome<br>reward models versus process reward<br>models outcome reward models are the<br>kind of traditional reward models that<br>people are trained for these for for<br>language models language modeling and<br>it's just looking at the final thing so<br>if you're doing some math problem let's<br>look at that final thing you've done<br>everything and let's assign a grade to<br>it How likely we think uh like what's<br>the reward for this this this outcome<br>process reward models Instead try to<br>grade The Chain of Thought and so open<br>AI had some preliminary paper on this I<br>think uh last summer where they use<br>human labelers to get this pretty large<br>several hundred thousand data set of<br>creating chains of thought um um<br>ultimately it feels like I haven't seen<br>anything interesting in the ways that<br>people use process reward models outside<br>of just using it as a means of uh<br>affecting how we choose between a bunch<br>of samples so like what people do uh in<br>all these papers is they sample a bunch<br>of outputs from the language model and<br>then use the process reward models to<br>grade uh all those Generations alongside<br>maybe some other heuristics and then use<br>that to choose the best answer the<br>really interesting thing that people<br>think might work and people want to work<br>is Tre search with these processor re<br>models because if you really can grade<br>every single step of the Chain of<br>Thought then you can kind of Branch out<br>and you know explore multiple Paths of<br>this Chain of Thought and then use these<br>process word models to evaluate how good<br>is this branch that you're<br>taking yeah when the when the quality of<br>the branch is somehow strongly<br>correlated with the quality of the<br>outcome at the very end so like you have<br>a good model of knowing which should<br>take so not just this in the short term<br>and like in the long term yeah and like<br>the interesting work that I think has<br>been done is figuring out how to<br>properly train the process or the<br>interesting work that has been open-<br>sourced and people I think uh talk about<br>is uh how to train the process reward<br>models um maybe in a more automated way<br>um I I could be wrong here could not be<br>mentioning some papers I haven't seen<br>anything super uh that seems to work<br>really well for using the process reward<br>models creatively to do tree search and<br>code um this is kind of an AI safety<br>maybe a bit of a philosophy question so<br>open AI says that they're hiding the<br>Chain of Thought from the user and<br>they've said that that was a difficult<br>decision to make they instead of showing<br>the Chain of Thought they're asking the<br>model to summarize the Chain of Thought<br>they're also in the background saying<br>they're going to monitor the Chain of<br>Thought to make sure the model is not<br>trying to manipulate the user which is a<br>fascinating possibility but anyway what<br>do you think about hiding the Chain of<br>Thought one consideration for open Ai<br>and this is completely speculative could<br>be that they want to make it hard for<br>people to distill these capabilities out<br>of their model it might actually be<br>easier if you had access to that hidden<br>Chain of Thought uh to replicate the<br>technology um because that's pretty<br>important data like seeing seeing the<br>steps that the model took to get to the<br>final result so you can probably train<br>on that also and there was sort of a<br>mirror situation with this with some of<br>the large language model providers and<br>also this is speculation but um some of<br>these apis um used to offer easy access<br>to log probabilities for the tokens that<br>they're generating um and also log<br>probabilities over the promp tokens and<br>then some of these apis took those away<br>uh and again complete speculation but um<br>one of the thoughts is that the the<br>reason those were taken away is if you<br>have access to log probabilities um<br>similar to this hidden train of thought<br>that can give you even more information<br>to to try and distill these capabilities<br>out of the apis out of these biggest<br>models into models you control as an<br>asteris on also the the previous<br>discussion about uh us integrating 01 I<br>think that we're still learning how to<br>use this model so we made o1 available<br>in cursor because like we were when we<br>got the model we were really interested<br>in trying it out I think a lot of<br>programmers are going to be interested<br>in trying it out but um uh 01 is not<br>part of the default cursor experience in<br>any way up um and we still haven't found<br>a way to yet integrate it into an editor<br>in uh into the editor in a way that we<br>we we reach for sort of you know every<br>hour maybe even every day and so I think<br>that the jury's still out on how to how<br>to use the model um and uh I we haven't<br>seen examples yet of of people releasing<br>things where it seems really clear like<br>oh that's that's like now the use case<br>um the obvious one to to turn to is<br>maybe this can make it easier for you to<br>have these background things running<br>right to have these models in Loops to<br>have these models be atic um but we're<br>still um still discovering to be clear<br>we have ideas we just need to we need to<br>try and get something incredibly useful<br>before we we put it out there but it has<br>these significant limitations like even<br>like barring capabilities uh it does not<br>stream and that means it's really really<br>painful to use for things where you want<br>to supervise the output um and instead<br>you're just waiting for the wall text to<br>show up um also it does feel like the<br>early Innings of test time Computing<br>search where it's just like a very very<br>much of V zero um and there's so many<br>things that like like don't feel quite<br>right and I suspect um in parallel to<br>people increasing uh the amount of<br>pre-training data and the size of the<br>models and pre-training and finding<br>tricks there you'll now have this other<br>thread of getting search to work better<br>and<br>better so let me ask you<br>about strawberry tomorrow<br>eyes so it looks like GitHub um co-pilot<br>might be integrating 01 in some kind of<br>way and I think some of the comments are<br>saying this this mean cursor is<br>done I think I saw one comment saying<br>that I saw time to shut down cursory<br>time to shut down<br>cursor so is it time to shut down cursor<br>I think this space is a little bit<br>different from past software spaces over<br>the the 2010s um where I think that the<br>ceiling here is really really really<br>incredibly high and so I think that the<br>best product in 3 to four years will<br>just be so much more useful than the<br>best product today and you can like Wax<br>potic about Moes this and brand that and<br>you know this is our uh Advantage but I<br>think in the end just if you don't have<br>like if you stop innovating on the<br>product you will you will lose and<br>that's also great for startups um that's<br>great for people trying to to enter this<br>Market um because it means you have an<br>opportunity um to win against people who<br>have you know lots of users already by<br>just building something better um and so<br>I think yeah over the next few years<br>it's just about building the best<br>product building the best system and<br>that both comes down to the modeling<br>engine side of things and it also comes<br>down to the to the editing experience<br>yeah I think most of the additional<br>value from cursor versus everything else<br>out there is not just integrating the<br>new model fast like o1 it comes from all<br>of the kind of depth that goes into<br>these custom models that you don't<br>realize are working for you in kind of<br>every facet of the product as well as<br>like the really uh thoughtful ux with<br>every single<br>feature all right uh from that profound<br>answer let's descend back down to the<br>technical you mentioned you have a<br>taxonomy of synthetic data oh yeah uh<br>can you please explain yeah I think uh<br>there are three main kinds of synthetic<br>data the first is so so what is<br>synthetic data first so there's normal<br>data like non- synthetic data which is<br>just data that's naturally created I.E<br>usually it'll be from humans having done<br>things so uh from some human process you<br>get this data synthetic data uh the<br>first one would be distillation so<br>having a language model kind of output<br>tokens or probability distributions over<br>tokens um and then you can train some<br>less capable model on this uh this<br>approach is not going to get you a net<br>like more capable model than the<br>original one that has produced The<br>Tokens um<br>but it's really useful for if there's<br>some capability you want to elicit from<br>some really expensive High latency model<br>you can then that distill that down into<br>some smaller task specific model um the<br>second kind is when like One Direction<br>of the problem is easier than the<br>reverse and so a great example of this<br>is bug detection like we mentioned<br>earlier where it's a lot easier to<br>introduce reasonable looking bugs<br>than it is to actually detect them and<br>this is this is probably the case for<br>humans too um and so what you can do is<br>you can get a model that's not training<br>that much data that's not that smart to<br>introduce a bunch of bugs and code and<br>then you can use that to then train use<br>a synthetic data to train a model that<br>can be really good at detecting bugs um<br>the last category I think is I guess the<br>main one that it feels like the big labs<br>are doing for synthetic data which is um<br>producing texts with language models<br>that can then be verified easily um so<br>like you know extreme example of this is<br>if you have a verification system that<br>can detect if language is Shakespeare<br>level and then you have a bunch of<br>monkeys typing and typewriters like you<br>can eventually get enough training data<br>to train a Shakespeare level language<br>model and I mean this is the case like<br>very much the case for math where<br>verification is is is actually really<br>really easy for formal um formal<br>language<br>and then what you can do is you can have<br>an OKAY model uh generate a ton of roll<br>outs and then choose the ones that you<br>know have actually proved the ground<br>truth theorems and train that further uh<br>there's similar things you can do for<br>code with leode like problems or uh<br>where if you have some set of tests that<br>you know correspond to if if something<br>passes these tests it has actually<br>solved a problem you could do the same<br>thing where we verify that it's passed<br>the test and then train the model the<br>outputs that have passed the tests um I<br>think I think it's going to be a little<br>tricky getting this to work in all<br>domains or just in general like having<br>the perfect verifier feels really really<br>hard to do with just like open-ended<br>miscellaneous tasks you give the model<br>or more like long Horizon tasks even in<br>coding that's cuz you're not as<br>optimistic as Arvid but yeah uh so yeah<br>so that that that third category<br>requires having a verifier yeah<br>verification is it feels like it's best<br>when you know for a fact that it's<br>correct and like then like it wouldn't<br>be like using a language model to verify<br>it would be using tests or uh formal<br>systems or running the thing too doing<br>like the human form of verification<br>where you just do manual quality control<br>yeah yeah but like the the language<br>model version of that where it's like<br>running the thing it's actually<br>understands yeah but yeah no that's sort<br>of somewhere between yeah yeah I think<br>that that's the category that is um most<br>likely to to result in like massive<br>gains what about RL with feedback side<br>rhf versus RL<br>if um what's the role of that in um<br>getting better performance on the<br>models yeah so<br>rhf is when the reward model you use uh<br>is trained from some labels you've<br>collected from humans giving<br>feedback um I think this works if you<br>have the ability to get a ton of human<br>feedback for this kind of task that you<br>care about r r aif is interesting uh<br>because you're kind of depending on like<br>this is actually kind of uh going to<br>it's depending on the constraint that<br>verification is actually a decent bit<br>easier than generation because it feels<br>like okay like what are you doing you're<br>using this language model to look at the<br>language model outputs and then improve<br>the language model but no it actually<br>may work if the language model uh has a<br>much easier time verifying some solution<br>uh than it does generating it then you<br>actually could perhaps get this kind of<br>recursively but I don't think it's going<br>to look exactly like that um the other<br>the other thing you could do<br>is that we kind of do is like a little<br>bit of a mix of rif and rhf where<br>usually the model is actually quite<br>correct and this is in the case of<br>cursor tab at at picking uh between like<br>two possible generations of what is what<br>is what is the better one and then it<br>just needs like a hand a little bit of<br>human nudging with only like on the on<br>the order of 50 100 uh examples um to<br>like kind of align that prior the model<br>has with exactly with what what you want<br>it looks different than I think normal<br>RF we usually usually training these<br>reward models in tons of<br>examples what what's your intuition when<br>you compare generation and verification<br>or generation and<br>ranking is is ranking way easier than<br>generation my intuition would just say<br>yeah it should be like this is kind<br>of going going back<br>to like if you if you believe P does not<br>equal NP then there's this massive class<br>of problems that are much much easier to<br>verify given a proof than actually<br>proving it I wonder if the same thing<br>will prove P not equal to NP or P equal<br>to NP that would be that would be really<br>cool that'd be a whatever Fields<br>metal by AI who gets the credit another<br>open philosophical<br>question I'm<br>I'm I'm actually surprisingly curious<br>what what what like a good betat for one<br>uh one a will get the fields medal will<br>be actually don't is this mon specialty<br>uh I I don't know what a Mon's bed here<br>is oh sorry Nobel Prize or Fields medal<br>first F Metal Fields metal level Feld<br>metal I think Fields metal comes first<br>well you would say that of course but<br>it's also this like isolated system you<br>can verify and no sure like I don't even<br>know if I you don't need to do have much<br>more I felt like the path to get to IMO<br>was a little bit more clear because it<br>already could get a few IMO problems and<br>there are a bunch of like there's a<br>bunch of lwh hang fruit given the<br>literature at the time of like what what<br>tactics people could take I think I'm<br>one much less first in the space of the<br>improving now and to yeah less intuition<br>about how close we are to solving these<br>really really hard open problems so you<br>think you'll be feels mod first it won't<br>be like in U physics or in oh 100% I<br>think I I think I think that's probably<br>more likely like it's probably much more<br>likely that it'll get in yeah yeah yeah<br>well I think it goes to like I don't<br>know like BSD which is a bird when turn<br>di conjecture like remon hypothesis or<br>any one of these like hard hard math<br>problems which just like actually really<br>hard it's sort of unclear what the path<br>to to get even a solution looks like<br>like we we don't even know what a path<br>looks like let alone um and you don't<br>buy the idea that this is like an<br>isolated system and you can actually you<br>have a good reward system and<br>uh it feels like it's easier to train<br>for that I think we might get Fields<br>metal before AGI I think I mean I'd be<br>very<br>happy be very happy but I don't know if<br>I I think 202h<br>2030 feels metal feels metal all right<br>it's uh it feels like forever from now<br>given how fast things have been going um<br>speaking of how fast things have been<br>going let's talk about scaling laws so<br>for people who don't know uh maybe it's<br>good to talk about this<br>whole uh idea of scaling laws what are<br>they where do things stand and where do<br>you think things are going I think it<br>was interesting the original scaling<br>laws paper by open AI was slightly wrong<br>because I think of some uh issues they<br>did with uh learning right schedules uh<br>and then chinchilla showed a more<br>correct version and then from then<br>people have again kind of deviated from<br>doing the computer optimal thing because<br>people people start now optimizing more<br>so for uh making the thing work really<br>well given a given an inference budget<br>and I think there are a lot more<br>Dimensions to these curves than what we<br>originally used of just compute number<br>of uh parameters and data like inference<br>compute is is the obvious one I think<br>context length is another obvious one so<br>if you care like let's say you care<br>about the two things of inference<br>compute and and then uh context window<br>maybe the thing you want to train is<br>some kind of SSM because they're much<br>much cheaper and faster at super super<br>long context and even if maybe it is 10x<br>wor scaling properties during training<br>meaning you have to spend 10x more<br>compute to train the thing to get the<br>same same level of capabilities um it's<br>worth it because you care most about<br>that inference budget for really long<br>context windows so it'll be interesting<br>to see how people kind of play with all<br>these Dimensions so yeah I mean you<br>speak to the multiple Dimensions<br>obviously the original conception was<br>just looking at the variables of the<br>size of the model as measured by<br>parameters and the size of the data as<br>measured by the number of tokens and<br>looking at the ratio of the two yeah and<br>it's it's kind of a compelling notion<br>that there is a number or at least a<br>minimum and it seems like one was<br>emerging um do you still believe that<br>there is a kind of bigger is<br>better I mean I think bigger is<br>certainly better for just raw<br>performance and raw intelligence and raw<br>intelligence I think the the path that<br>people might take is I'm particularly<br>bullish on distillation and like yeah<br>how many knobs can you turn to if we<br>spend like a ton ton of money on<br>training like get the most capable uh<br>cheap model right like really really<br>caring as much as you can because like<br>the the the naive version of caring as<br>much as you can about inference time<br>Compu is what people have already done<br>with like the Llama models are just<br>overtraining the out of 7B models<br>um on way way way more tokens than isal<br>optimal right but if you really care<br>about it maybe thing to do is what Gemma<br>did which is let's just not let's not<br>just train on tokens let's literally<br>train on<br>uh minim minimizing the K Divergence<br>with uh the distribution of Gemma 27b<br>right so knowledge distillation there um<br>and you're spending the compute of<br>literally training this 27 billion model<br>uh billion parameter model on all these<br>tokens just to get out this I don't know<br>smaller model and the distillation gives<br>just a faster model smaller means faster<br>yeah distillation in theory is um I<br>think getting out more signal from the<br>data that you're training on and it's<br>like another it's it's perhaps another<br>way of getting over not like completely<br>over but like partially helping with the<br>data wall where like you only have so<br>much data to train on let's like train<br>this really really big model on all<br>these tokens and we'll distill it into<br>this smaller one and maybe we can get<br>more signal uh per token uh for this for<br>this much smaller model than we would<br>have originally if we trained it so if I<br>gave you1 trillion how would you how<br>would you spend it I mean you can't buy<br>an island or whatever um how would you<br>allocate it in terms of improving the<br>the big model<br>versus maybe paying for HF in the rhf or<br>yeah I think there's a lot of these<br>secrets and details about training these<br>large models that I I I just don't know<br>and are only priv to the large labs and<br>the issue is I would waste a lot of that<br>money if I even attempted this because I<br>wouldn't know those things uh suspending<br>a lot of disbelief and assuming like you<br>had the<br>knowhow um and operate or or if you're<br>saying like you have to operate with<br>like the The Limited information you<br>have now no no no actually I would say<br>you swoop in and you get all the<br>information all the little<br>characteristics all the little<br>parameters all the all the parameters<br>that Define how the thing is trained mhm<br>if we look<br>and how to invest money for the next 5<br>years in terms of maximizing what you<br>called raw intelligence I mean isn't the<br>answer like really simple you just you<br>just try to get as much compute as<br>possible like like at the end of the day<br>all all you need to buy is the gpus and<br>then the researchers can find find all<br>the all like they they can sort of you<br>you can tune whether you want between a<br>big model or a small model like well<br>this gets into the question of like are<br>you really limited by compute and money<br>or are you limited by these other things<br>and I'm more PR to arvid's arvid's<br>belief that we're we're sort of Ideal<br>limited but there's always that like but<br>if you have a lot of computes you can<br>run a lot of experiments so you would<br>run a lot of experiments versus like use<br>that compute to train a gigantic model I<br>would but I I do believe that we are<br>limited in terms of ideas that we have I<br>think yeah because even with all this<br>compute and like you know all the data<br>you could collect in the world than you<br>really are ultimately limited by not<br>even ideas but just like really good<br>engineering like even with all the<br>capital in the world would you really be<br>able to assemble like there aren't that<br>many people in the world who really can<br>like make the difference here um and and<br>there's so much work that goes into<br>research that is just like pure really<br>really hard engineering work um as like<br>a very kind of handwavy example if you<br>look at the original Transformer paper<br>you know how much work was kind of<br>joining together a lot of these really<br>interesting Concepts embedded in the<br>literature versus then going in and<br>writing all the code like maybe the Cuda<br>kernels maybe whatever else I don't know<br>if it ran on gpus or tpus originally<br>such that it actually saturated the GP<br>GPU performance right getting Gomes here<br>to go in and do do all this code right<br>and Nome is like probably one of the<br>best engineers in the world or maybe<br>going a step further like the next<br>generation of models having these things<br>like getting model Paralis to work and<br>scaling it on like you know thousands of<br>or maybe tens of thousands of like v100s<br>which I think gbd3 may have been um<br>there's just so much engineering effort<br>that has to go into all of these things<br>to make it work um if you really brought<br>that cost down<br>to like you know maybe not zero but just<br>made it 10x easier made it super easy<br>for someone with really fantastic ideas<br>to immediately get to the version of<br>like the new architecture they dreamed<br>up that is like getting 50 40% uh<br>utilization on the gpus I think that<br>would just speed up research by a ton I<br>mean I think I think if if you see a<br>clear path to Improvement you you should<br>always sort of take the low hanging<br>fruit first right and I think probably<br>open eye and and all the other labs it<br>did the right thing to pick off the low<br>hanging fruit where the low hanging<br>fruit is like sort<br>of you you could scale up to a GP<br>24.25<br>scale um and and you just keep scaling<br>and and like things things keep getting<br>better and as long as like you there's<br>there's no point of experimenting with<br>new ideas when like everything<br>everything is working and you should<br>sort of bang on and try try to get as<br>much as much juice out as the possible<br>and then and then maybe maybe when you<br>really need new ideas for I think I<br>think if you're if you're spending $10<br>trillion you probably want to spend some<br>you know then actually like reevaluate<br>your ideas like probably your idea<br>Limited at that point I think all of us<br>believe new ideas are probably needed to<br>get you know all the way way there to<br>Ai<br>and all of us also probably believe<br>there exist ways of testing out those<br>ideas at smaller<br>scales um and being fairly confident<br>that they'll play out it's just quite<br>difficult for the labs in their current<br>position to dedicate their very limited<br>research and Engineering talent to<br>exploring all these other ideas when<br>there's like this core thing that will<br>probably improve performance um for some<br>like decent amount of<br>time yeah but also these big Labs like<br>winning so they're just going wild<br>okay so how uh big question looking out<br>into the future you're now at the the<br>center of the programming world how do<br>you think programming the nature<br>programming changes in the next few<br>months in the next year in the next two<br>years the next 5 years 10 years I think<br>we're really excited about a future<br>where the programmer is in the driver's<br>seat for a long time and you've heard us<br>talk about this a little bit but one<br>that<br>emphasizes speed and agency for the<br>programmer and control the ability to<br>modify anything you want to modify the<br>ability to iterate really fast on what<br>you're<br>building<br>and this is a little different I think<br>than where some people um are are<br>jumping to uh in the space where I think<br>one idea that's captivated people is can<br>you talk to your um computer can you<br>have it build software for you as if<br>you're talking to like an engineering<br>department or an engineer over slack and<br>can it just be this this sort of<br>isolated text box and um part of the<br>reason we're not excited about that is<br>you know some of the stuff we've talked<br>about with latency but then a big piece<br>a reason we're not excited about that is<br>because that comes with giving up a lot<br>of control it's much harder to be really<br>specific when you're talking in the text<br>box and um if you're necessarily just<br>going to communicate with a thing like<br>you would be communicating with an<br>engineering department you're actually<br>abdicating tons of tons of really<br>important decisions um to the spot um<br>and this kind of gets at fundamentally<br>what engineering is um I think that some<br>some people who are a little bit more<br>removed from engineering might think of<br>it as you know the spec is completely<br>written out and then the engineers just<br>come and they just Implement and it's<br>just about making the thing happen in<br>code and making the thing um exists um<br>but I think a lot of the the best<br>engineering the engineering we<br>enjoy um involves tons of tiny micro<br>decisions about what exactly you're<br>building and about really hard<br>trade-offs between you know speed and<br>cost and all the other uh things<br>involved in a system and uh we want as<br>long as humans are actually the ones<br>making you know designing the software<br>and the ones um specifying what they<br>want to be built and it's not just like<br>company run by all AIS we think you'll<br>really want the humor the human in a<br>driver seat um dictating these decisions<br>and so there's the jury still out on<br>kind of what that looks like I think<br>that you know one weird idea for what<br>that could look like is it could look<br>like you kind of you can control the<br>level of abstraction you view a codebase<br>at and you can point at specific parts<br>of a codebase that um like maybe you<br>digest a code Base by looking at it in<br>the form of pseudo code and um you can<br>actually edit that pseudo code too and<br>then have changes get made down at the<br>the sort of formal programming level and<br>you keep the like you know you can<br>gestat any piece of logic uh in your<br>software component of programming you<br>keep the inflow text editing component<br>of programming you keep the control of<br>you can even go down into the code you<br>can go at higher levels of abstraction<br>while also giving you these big<br>productivity gains it would be nice if<br>you can go up and down the the<br>abstraction stack yeah and there are a<br>lot of details to figure out there<br>that's sort of a fuzzy idea time will<br>tell if it actually works but these<br>these principles of of control and speed<br>in the human and the driver seat we<br>think are really important um we think<br>for some things like Arvid mentioned<br>before for some styles of programming<br>you can kind of hand it off chapot style<br>you know if you have a bug that's really<br>well specified but that's not most of<br>programming and that's also not most of<br>the programming we think a lot of people<br>value uh what about like the fundamental<br>skill of programming there's a lot of<br>people<br>like young people right now kind of<br>scared like thinking because they like<br>love programming but they're scared<br>about like will I be able to have a<br>future if I pursue this career path do<br>you think the very skill of programming<br>will change fundamentally I actually<br>think this is a really really exciting<br>time to be building software yeah like<br>we remember what programming was like in<br>you know 2013 2012 whatever it was um<br>and there was just so much more Cru and<br>boilerplate and and you know looking up<br>something really gnarly and you know<br>that stuff still exists it's definitely<br>not at zero but programming today is way<br>more fun than back then um it's like<br>we're really getting down to the the<br>Delight concentration and all all the<br>things that really draw people to<br>programming like for instance this<br>element of being able to build things<br>really fast and um speed and also<br>individual control like all those are<br>just being turned up a ton um and so I<br>think it's just going to be I think it's<br>going to be a really really fun time for<br>people who build software um I think<br>that the skills will probably change too<br>I I think that people's taste and<br>creative ideas will be magnified and it<br>will be less<br>about maybe less a little bit about<br>boilerplate text editing maybe even a<br>little bit less about carefulness which<br>I think is really important today if<br>you're a programmer I think it'll be a<br>lot more fun what do you guys think I<br>agree I'm I'm very excited to be able to<br>change like just what one thing that<br>that happened recently was like we<br>wanted to do a relatively big migration<br>to our codebase we were using async<br>local storage in in no. JS which is<br>known to be not very performant and we<br>wanted to migrate to our context object<br>and this is a big migration it affects<br>the entire code base and swall and I<br>spent I don't know five days uh working<br>through this even with today's AI tools<br>and I am really excited for a future<br>where I can just show a couple of<br>examples and then the AI applies that to<br>all of the locations and then it<br>highlights oh this is a new example like<br>what should I do and then I show exactly<br>what to do there and then that can be<br>done in like 10 minutes uh and then you<br>can iterate much much faster then you<br>can then you don't have to think as much<br>up front and stay stand at the<br>Blackboard and like think exactly like<br>how are we going to do this because the<br>cost is so high but you can just try<br>something first and you realize oh this<br>is not actually exactly what I want and<br>then you can change it instantly again<br>after and so yeah I think being a<br>programmer in the future is going to be<br>a lot of fun yeah I I really like that<br>point about it feels like a lot of the<br>time with programming there are two ways<br>you can go about it one is like you<br>think really hard carefully upfront<br>about the best possible way to do it and<br>then you spend your limited time of<br>engineering to actually implement it uh<br>but I much prefer just getting in the<br>code and like you know taking a crack at<br>seeing how it how how it kind of lays<br>out and then<br>iterating really quickly on that that<br>feels more fun um yeah like just<br>speaking to generating the boiler plate<br>is great so you just focus on the<br>difficult design nuanced difficult<br>design decisions migration I feel like<br>this is this is a cool one like it seems<br>like large language models able to<br>basically translate from one programm<br>language to another or like translate<br>like migrate in the general sense of<br>what migrate is um but that's in the<br>current moment so I mean the fear has to<br>do with like okay as these models get<br>better and better then you're doing less<br>and less creative decisions and is it<br>going to kind of move to a place where<br>it's uh you're operating in the design<br>space of natural language where natural<br>language is the main programming<br>language and I guess I could ask that by<br>way of advice like if somebody's<br>interested in programming now what do<br>you think they should<br>learn like to say you guys started in<br>some<br>Java and uh I forget the oh some PHP PHP<br>Objective C Objective C there you go um<br>I mean in the end we all know JavaScript<br>is going to<br>win uh and not typescript it's just it's<br>going to be like vanilla JavaScript it's<br>just going to eat the world and maybe a<br>little bit of PHP and I mean it also<br>brings up the question of like I think<br>Don can has a this idea that some per of<br>the population is Geeks and like there's<br>a particular kind of psychology in mind<br>required for programming and it feels<br>like more and more that's expands the<br>kind of person that should be able to<br>can do great programming might<br>expand I think different people do<br>programming for different reasons but I<br>think the true maybe like the best<br>programmers um are the ones that really<br>love just like absolutely love<br>programming for example there folks in<br>our team who<br>literally when they're they get back<br>from work they go and then they boot up<br>cursor and then they start coding on<br>their side projects for the entire night<br>and they stay till 3:00 a.m. doing that<br>um and when they're sad they they<br>said I just really need to<br>code and I I I think like you know<br>there's there's that level of programmer<br>where like this Obsession and love of<br>programming um I think makes really the<br>best programmers and I think the these<br>types of people<br>will really get into the details of how<br>things work I guess the question I'm<br>asking that exact program I think about<br>that<br>person when you're when the super tab<br>the super awesome praise be the tab is<br>succeeds you keep PR pressing tab that<br>person in the team loves to cursor tab<br>more than anybody else right yeah and<br>it's also not just like like pressing<br>tab is like the just pressing tab that's<br>like the easy way to say it in the The<br>Catch catchphrase you know uh but what<br>you're actually doing when you're<br>pressing tab is that you're you're<br>injecting intent uh all the time while<br>you're doing it you're you're uh<br>sometimes you're rejecting it sometimes<br>you're typing a few more characters um<br>and and that's the way that you're um<br>you're sort of shaping the things that's<br>being created and I I think programming<br>will change a lot to just what is it<br>that you want to make it's sort of<br>higher bandwidth the communication to<br>the computer just becomes higher and<br>higher bandwidth as opposed to like like<br>just typing is much lower bandwidth than<br>than communicating intent I mean this<br>goes to your uh<br>Manifesto titled engineering genius we<br>are an applied research lab building<br>extraordinary productive human AI<br>systems So speaking to this like hybrid<br>element mhm uh to start we're building<br>the engineer of the future a human AI<br>programmer that's an order of magnitude<br>more effective than any one engineer<br>this hybrid engineer will have<br>effortless control over their code base<br>and no low entropy keystrokes they will<br>iterate at the speed of their judgment<br>even in the most complex systems using a<br>combination of AI and human Ingenuity<br>they will outsmart and out engineer the<br>best pure AI systems we are a group of<br>researchers and Engineers we build<br>software and models to invent at the<br>edge of what's useful and what's<br>possible our work has already improved<br>the lives of hundreds of thousands of<br>program<br>and on the way to that will at least<br>make programming more fun so thank you<br>for talking today thank you thanks for<br>having us thank you thank you thanks for<br>listening to this conversation with<br>Michael swall Arvid and Aman to support<br>this podcast please check out our<br>sponsors in the description and now let<br>me leave you with a random funny and<br>perhaps profound programming code I saw<br>on<br>Reddit nothing is as permanent as a<br>temporary solution that<br>works thank you for listening and hope<br>to see you next time - The following is a conversation<br>with the founding members<br>of the Cursor Team,<br>Michael Truell, Sualeh<br>Asif, Arvid Lunnemark,<br>and Aman Sanger.<br>Cursor is a code editor based on VS Code<br>that adds a lot of powerful<br>features for AI-assisted coding.<br>It has captivated the<br>attention and excitement<br>of the programming and AI communities.<br>So I thought, this is<br>an excellent opportunity<br>to dive deep into the<br>role of AI in programming.<br>This is a super technical conversation<br>that is bigger than just<br>about one code editor.<br>It's about the future of programming,<br>and in general, the future<br>of human AI collaboration<br>in designing and engineering complicated<br>and powerful systems.<br>This is the "Lex Fridman Podcast."<br>To support it,<br>please check out our<br>sponsors in the description.<br>And now, dear friends,<br>here's Michael, Sualeh, Arvid and Aman.<br>All right, this is awesome.<br>We have Michael, Aman, Sualeh, Arvid here<br>from the Cursor Team.<br>First up, big ridiculous question.<br>What's the point of a code editor?<br>- So the code editor is largely the place<br>where you build software.<br>And today or for a long<br>time, that's meant the place<br>where you text edit a<br>formal programming language.<br>And for people who aren't programmers,<br>the way to think of a code editor<br>is a really souped up word<br>processor for programmers,<br>where the reason it's souped up<br>is code has a lot of structure.<br>And so the, quote,<br>unquote, "word processor,"<br>the code editor can<br>actually do a lot for you<br>that word processors in the writing space<br>haven't been able to do for<br>people editing texts there.<br>And so that's everything<br>from giving you visual differentiation<br>of the actual tokens in the<br>code so you can scan it quickly,<br>to letting you navigate<br>around the code base,<br>like you're navigating around<br>the internet with hyperlinks,<br>you're going to definitions<br>of things you're using<br>to error checking to<br>catch rudimentary bugs.<br>And so traditionally, that's<br>what a code editor has meant.<br>And I think that what a code editor is,<br>is going to change a lot<br>over the next 10 years<br>as what it means to build software<br>maybe starts to look a bit different.<br>- I think also a code<br>editor should just be fun.<br>- Yes, that is very important,<br>that is very important.<br>And it's actually an underrated aspect<br>of how we decide what to build.<br>A lot of the things that we<br>build and then we try them out,<br>we do an experiment and then<br>we actually throw them out<br>because they're not fun.<br>And so, a big part of being fun<br>is being fast a lot of the time.<br>Fast is fun.<br>- Yeah, fast is. (chuckles)<br>Yeah, that should be a T-shirt.<br>(group chuckling)<br>- Fundamentally, I think one of the things<br>that draws a lot of people to<br>building stuff on computers<br>is this insane iteration speed,<br>where in other disciplines<br>you might be gate capped<br>by resources or the ability.<br>Even the ability to get<br>a large group together<br>and coding is this amazing thing<br>where it's you and the<br>computer and that alone,<br>you can build really cool<br>stuff really quickly.<br>- So for people who don't know,<br>Cursor is this super cool new editor<br>that's a fork of VS Code.<br>It would be interesting<br>to get your explanation<br>of your own journey of editors.<br>I think all of you were big<br>fans of VS Code with Copilot.<br>How did you arrive to VS Code<br>and how did that lead to<br>your journey with Cursor?<br>- Yeah, so I think a lot of us,<br>well, all of us were originally Vim users.<br>- Pure Vim.<br>- Pure Vim, yeah.<br>No Neovim, just pure Vim and a terminal.<br>And at least for myself,<br>it was around the time<br>that Copilot came out,<br>so 2021 that I really wanted to try it.<br>So, I went into VS Code,<br>the only code editor in<br>which it was available,<br>and even though I really<br>enjoyed using Vim,<br>just the experience of<br>Copilot with VS Code<br>was more than good enough<br>to convince me to switch.<br>And so that kind of was the default<br>until we started working on Cursor.<br>- And maybe we should<br>explain what Copilot does.<br>It's a really nice auto complete.<br>As you start writing a thing,<br>it suggests one or two or three lines<br>how to complete the thing.<br>And there's a fun experience in that.<br>You know like when you<br>have a close friendship<br>and your friend completes your sentences?<br>(group chuckles)<br>When it's done well,<br>there's an intimate feeling.<br>There's probably a better<br>word than intimate,<br>but there's a cool feeling of<br>like, "Holy shit, it gets me."<br>(all chuckles)<br>And then, there's an unpleasant feeling<br>when it doesn't get you.<br>And so, there's that kind of friction.<br>But I would say for a lot of people,<br>the feeling that it gets me<br>overpowers that it doesn't.<br>- And, I think, actually one<br>of the underrated aspects<br>of Github Copilot is that<br>even when it's wrong,<br>it's a little bit annoying,<br>but it's not that bad<br>because you just type another character,<br>and then maybe then it gets you,<br>or you type another character<br>and then it gets you.<br>So even when it's wrong,<br>it's not that bad.<br>- Yeah, you can iterate and fix it.<br>I mean, the other underrated<br>part of Copilot for me<br>was just the first real AI product.<br>So the first language<br>model consumer product.<br>- So, Copilot was like<br>the first killer app<br>for LLMs.<br>- Yeah.<br>- Yeah, and the beta was out in 2021.<br>- Right, okay.<br>So, what's the origin story of Cursor?<br>- So around 2020,<br>the scaling loss papers<br>came out from OpenAI,<br>and that was a moment<br>where this looked like<br>clear predictable progress<br>for the field where even if<br>we didn't have any more ideas,<br>it looked like you could make<br>these models a lot better<br>if you had more compute and more data.<br>- By the way, we'll probably<br>talk for three to four hours<br>on the topic of scaling loss.<br>(group chuckling)<br>But just to summarize,<br>it's a paper in a set of<br>papers in a set of ideas<br>that say bigger might be better<br>for model size and data size<br>in the realm of machine learning.<br>- It's bigger and better,<br>but predictably better.<br>- Okay, that's another<br>topic of conversation.<br>- Yes.<br>- Yeah.<br>- So around that time for some of us,<br>there were a lot of<br>conceptual conversations<br>about what's this gonna look like?<br>What's the story gonna be<br>for all these different<br>knowledge worker fields<br>about how they're gonna be made better<br>by this technology getting better?<br>And then, I think, there<br>were a couple of moments<br>where the theoretical gains predicted<br>in that paper started<br>to feel really concrete,<br>and it started to feel like a moment<br>where you could actually<br>go and not do a PhD<br>if you wanted to do useful work in AI.<br>It actually felt like now<br>there was this whole set<br>of systems one could build<br>that were really useful.<br>And I think that the first moment<br>we already talked about a little bit,<br>which was playing with<br>the early beta of Copilot,<br>that was awesome and magical.<br>I think that the next big moment<br>where everything kind of clicked together<br>was actually getting<br>early access to GPT-4.<br>So, it was sort of end of 2022<br>was when we were<br>tinkering with that model,<br>and the step-upping<br>capabilities felt enormous.<br>And previous to that,<br>we had been working on a<br>couple of different projects.<br>Because of Copilot,<br>because of scaling odds,<br>because of our prior<br>interest in the technology,<br>we had been tinkering around<br>with tools for programmers,<br>but things that are very specific.<br>So, we were building tools<br>for financial professionals<br>who have to work within a Jupyter Notebook<br>or playing around with<br>can you do static analysis<br>with these models?<br>And then, the step-up in GPT-4 felt like,<br>look, that really made<br>concrete the theoretical gains<br>that we had predicted before.<br>It felt like you could build<br>a lot more just immediately<br>at that point in time.<br>And also, if we were being<br>consistent, it really felt like<br>this wasn't just gonna be<br>a point solution thing.<br>This was gonna be all of<br>programming was gonna flow<br>through these models.<br>And it felt like that<br>demanded a different type<br>of programming environment, a<br>different type of programming.<br>And so, we set off to build<br>that larger vision around then.<br>- There's one that I distinctly remember.<br>So, my roommate is an IMO Gold winner<br>and there's a competition<br>in the US called the PUTNAM,<br>which is the IMO for college people,<br>and it's this math competition.<br>It's exceptionally good.<br>So, Shengtong and Aman I<br>remember, sort of June of 2022,<br>had this bet on whether<br>the 2024 June or July,<br>you were going to win a gold<br>medal in the IMO with models.<br>- IMO is the International Math Olympiad.<br>- Yeah, IMO is<br>International Math Olympiad.<br>And so, Arvid and I are<br>both also competing in it.<br>So, it was sort of personal.<br>(group chuckling)<br>And I remember thinking, "Matt,<br>this is not gonna happen."<br>Even though I sort of<br>believed in progress,<br>I thought IMO Gold, Aman is delusional.<br>And to be honest, I mean, I<br>was, to be clear, very wrong.<br>But that was maybe the most<br>prescient bet in the group.<br>- So the new results from DeepMind,<br>it turned out that you were correct.<br>(group chattering)<br>- [Arvid] Technically not.<br>- Technically incorrect<br>but one point away.<br>- Aman was very enthusiastic<br>about this stuff back then.<br>And before, Aman had<br>this scaling loss T-shirt<br>that he would wear around<br>where it had the charts<br>and the formulas on it.<br>- So, you felt the AGI or<br>you felt the scaling loss.<br>- Yeah, I distinctly remember<br>there was this one<br>conversation I had with Michael<br>before I hadn't thought super deeply<br>and critically about scaling laws.<br>And he kind of posed the question,<br>why isn't scaling all you need,<br>or why isn't scaling gonna result<br>in massive gains in progress?<br>And I think I went through<br>the stages of grief.<br>There is anger, denial,<br>and then finally at the end<br>just thinking about it, acceptance.<br>And I think I've been quite hopeful<br>and optimistic about progress since.<br>I think one thing I'll caveat<br>is, I think, it also depends<br>on which domains you're<br>gonna see progress.<br>Math is a great domain<br>especially formal theorem proving<br>because you get this fantastic<br>signal of actually verifying<br>if the thing was correct.<br>And so this means<br>something like RL can<br>work really, really well,<br>and I think you could have systems<br>that are perhaps very superhuman in math<br>and still not technically have AGI.<br>- Okay, so can we take<br>it all the way to Cursor.<br>And what is Cursor?<br>It's a fork of VS Code,<br>and VS Code is one of<br>the most popular editors<br>for a long time.<br>Everybody fell in love with it.<br>Everybody left Vim, I left DMAX for it.<br>Sorry.<br>(all laughing)<br>So, unified in some fundamental<br>way the developer community.<br>And then, you look at the space of things,<br>you look at the scaling<br>laws, AI is becoming amazing.<br>And you decided, okay, it's not enough<br>to just write an extension via VS Code<br>because there's a lot<br>of limitations to that.<br>If AI is gonna keep getting<br>better and better and better,<br>we need to really rethink<br>how the AI is gonna be part<br>of the editing process.<br>And so, you decided to fork VS Code,<br>and start to build a lot<br>of the amazing features<br>we'll be able to talk about.<br>But what was that decision like?<br>Because there's a lot of<br>extensions, including Copilot,<br>of VS Code that are doing<br>sort of AI type stuff.<br>What was the decision<br>like to just fork VS Code?<br>- So the decision to do an<br>editor seemed self-evident to us<br>for at least what we<br>wanted to do and achieve,<br>because when we started<br>working on the editor,<br>the idea was these models<br>are gonna get much better,<br>their capabilities are gonna improve,<br>and it's gonna entirely<br>change how you build software,<br>both in a you will have<br>big productivity gains<br>but also radical and now<br>the active building software<br>is gonna change a lot.<br>And so, you're very limited<br>in the control you have over a code editor<br>if you're a plugin to an<br>existing coding environment,<br>and we didn't wanna get locked<br>in by those limitations.<br>We wanted to be able to just<br>build the most useful stuff.<br>- Okay.<br>Well then, the natural question is,<br>VS Code is kind of with<br>Copilot a competitor,<br>so how do you win?<br>Is it basically just the<br>speed and the quality<br>of the features?<br>- Yeah, I mean, I think this is a space<br>that is quite interesting,<br>perhaps quite unique<br>where if you look at previous tech waves,<br>maybe there's kind of one<br>major thing that happened<br>and it unlocked a new wave of companies,<br>but every single year, every<br>single model capability<br>or jump you get in model capabilities,<br>you now unlock this new wave of features,<br>things that are possible,<br>especially in programming.<br>And so, I think, in AI programming,<br>being even just a few months<br>ahead, let alone a year ahead,<br>makes your product much,<br>much, much more useful.<br>I think the Cursor a year from now<br>will need to make the Cursor<br>of today look obsolete.<br>And I think Microsoft has done<br>a number of fantastic things,<br>but I don't think they're in a great place<br>to really keep innovating<br>and pushing on this in the<br>way that a startup can.<br>- Just rapidly implementing features.<br>- Yeah, and doing the research<br>experimentation necessary<br>to really push the ceiling.<br>- I don't know if I think<br>of it in terms of features<br>as I think of it in terms of<br>capabilities for programmers.<br>As the new o1 model came out,<br>and I'm sure there are<br>gonna be more models<br>of different types, like longer<br>context and maybe faster,<br>there's all these crazy<br>ideas that you can try,<br>and hopefully 10% of the crazy ideas<br>will make it into something<br>kind of cool and useful<br>and we want people to have that sooner.<br>To rephrase, an underrated fact<br>is we're making it for ourself.<br>When we started Cursor,<br>you really felt this<br>frustration that models,<br>you could see models getting better,<br>but the Copilot experience<br>had not changed.<br>It was like, "Man, the<br>ceiling is getting higher,<br>why are they not making new things?<br>They should be making new things.<br>Where's all the alpha features?<br>There were no alpha features."<br>I'm sure it was selling well.<br>I'm sure it was a great business,<br>I'm one of these people<br>that really want to<br>try and use new things,<br>and there was no new thing<br>for a very long while.<br>- Yeah, it's interesting.<br>I don't know how you put that into words,<br>but when you compare<br>a Cursor with Copilot,<br>Copilot pretty quickly<br>started to feel stale<br>for some reason.<br>- Yeah, I think one thing<br>that I think helps us<br>is that we're doing it all in one<br>where we're developing the UX<br>and the way you interact with the model<br>at the same time as we're developing<br>how we actually make the<br>model give better answers.<br>So, how you build up the prompt<br>or how do you find the<br>context and for a Cursor Tab,<br>how do you train the model?<br>So, I think that helps<br>us to have all of it<br>the same people working on the<br>entire experience end to end.<br>- Yeah, it's like the person making the UI<br>and the person training the<br>model sit like 18 feet away.<br>- [Aman] Often the same person even.<br>- Yeah, often even the same person.<br>You can create things that<br>are sort of not possible<br>if you're not talking,<br>you're not experimenting.<br>- And you're using, like you<br>said, Cursor to write Cursor?<br>- Of course.<br>- Oh, yeah.<br>- Yeah.<br>- Well, let's talk about<br>some of these features.<br>Let's talk about the all-knowing,<br>the all-powerful praise be to the Tab,<br>(group chuckles)<br>auto complete on steroids basically.<br>So how does Tab work?<br>What is Tab?<br>- To highlight and<br>summarize at a high level,<br>I'd say that there are two things<br>that Cursor is pretty good at right now.<br>There are other things that it does,<br>but two things that it<br>helps programmers with.<br>One is this idea of<br>looking over your shoulder,<br>and being a really fast colleague<br>who can kind of jump<br>ahead of you, and type,<br>and figure out what you're gonna do next.<br>That was the kernel of the idea<br>behind a good auto complete<br>was predicting what you're gonna do next,<br>but you can make that<br>concept even more ambitious<br>by not just predicting the<br>characters after your Cursor<br>but actually predicting<br>the next entire change<br>you're gonna make, the next diff,<br>next place you're gonna jump to.<br>And the second thing Cursor is<br>pretty good at right now too<br>is helping you sometimes<br>jump ahead of the AI<br>and tell it what to do and<br>go from instructions to code.<br>And on both of those,<br>we've done a lot of work<br>on making the editing experience<br>for those things ergonomic<br>and also making those<br>things smart and fast.<br>- One of the things we really wanted,<br>was we wanted the model to<br>be able to edit code for us.<br>That was kind of a wish and<br>we had multiple attempts at it<br>before we had a good model<br>that could edit code for you.<br>Then after we had a good model,<br>I think there've been a lot of effort<br>to make the inference fast<br>for having a good experience,<br>and we've been starting to incorporate,<br>I mean, Michael sort of<br>mentioned this ability<br>to jump to different places,<br>and that jump to different<br>places I think came<br>from a feeling of once you accept an edit,<br>it's like, "Man, it should<br>be just really obvious<br>where to go next."<br>It's like, "I'd made this change,<br>the model should just know<br>that the next place to<br>go to is 18 lines down."<br>If you're a WIM user, you<br>could press 18JJ or whatever,<br>but why am I doing this?<br>The model should just know it.<br>So the idea was you just press Tab,<br>it would go 18 lines down,<br>and then show you the next<br>edit and you would press Tab,<br>so as long as you could keep pressing Tab.<br>And so the internal competition was,<br>how many Tabs can we make someone press?<br>Once you have the idea, more abstractly,<br>the thing to think about is<br>how are the edits zero entropy?<br>There's no new bits of information<br>to finish your thought,<br>but you still have to type some characters<br>to make the computer understand<br>what you're actually thinking,<br>then maybe the model<br>should just read your mind<br>and all the zero entropy bits<br>should just be tabbed away.<br>That was sort of the abstract version.<br>- There's this interesting thing<br>where if you look at language model loss<br>on different domains, I<br>believe the bits per byte,<br>which is a kind of<br>character normalize loss<br>for code is lower than language,<br>which means in general, there<br>are a lot of tokens in code<br>that are super predictable,<br>a lot of characters that<br>are super predictable.<br>And this is I think even magnified<br>when you're not just trying<br>to auto complete code,<br>but predicting what the<br>user's going to do next<br>in their editing of existing code.<br>And so, the goal of Cursor<br>Tab is let's eliminate<br>all the low entropy actions<br>you take inside of the editor.<br>When the intent is effectively determined,<br>let's just jump you forward<br>in time, skip you forward.<br>- Well, what's the intuition<br>and what's the technical details<br>of how to do next Cursor prediction?<br>That jump, that's not so<br>intuitive I think to people.<br>- Yeah.<br>I think I can speak to<br>a few of the details<br>on how to make these things work.<br>They're incredibly low latency,<br>so you need to train<br>small models on this task.<br>In particular, they're<br>incredibly pre-fill token hungry.<br>What that means is they have these really,<br>really long prompts where<br>they see a lot of your code<br>and they're not actually<br>generating that many tokens.<br>And so, the perfect fit for<br>that is using a sparse model,<br>meaning an MOE model.<br>So that was one breakthrough we made<br>that substantially<br>improved its performance<br>at longer context.<br>The other being a variant<br>of speculative decoding<br>that we built out called<br>speculative edits.<br>These are two, I think, important pieces<br>of what make it quite high<br>quality and very fast.<br>- Okay, so MoE, Mixture of Experts,<br>the input is huge, the output is small.<br>- Yeah.<br>- Okay.<br>Does caching play a role-<br>- Oh, caching plays a huge role.<br>Because you're dealing with<br>this many input tokens,<br>if every single keystroke that<br>you're typing in a given line<br>you had to rerun the model on<br>all of those tokens passed in,<br>you're just going to one,<br>significantly degrade latency,<br>two, you're gonna kill<br>your GPUs with load.<br>So, you need to design<br>the actual prompts you use<br>for the model such that<br>they're caching aware.<br>And then yeah, you need<br>to reuse the KV cache<br>across requests<br>just so that you're spending<br>less work, less compute.<br>- Again, what are the<br>things that Tab is supposed<br>to be able to do in the near<br>term, just to linger on that?<br>Generate code, fill empty space,<br>also edit code across multiple lines,<br>and then jump to different<br>locations inside the same file,<br>and then-<br>- Hopefully,<br>jump to different files also.<br>So if you make an edit in one file,<br>and maybe you have to go to another file<br>to finish your thought,<br>it should go to the second file also.<br>- The full generalization<br>is next action prediction.<br>Sometimes, you need to run<br>a command in the terminal<br>and it should be able to<br>suggest the command based<br>on the code that you wrote too.<br>It suggests something,<br>but it's hard for you<br>to know if it's correct<br>because you actually need some<br>more information to learn.<br>You need to know the<br>type to be able to verify<br>that it's correct.<br>And so maybe it should<br>actually take you to a place<br>that's the definition of something,<br>and then take you back<br>so that you have all<br>the requisite knowledge<br>to be able to accept the next completion.<br>- So providing the human the knowledge.<br>- [Arvid] Yes.<br>- Right.<br>- Mm-hmm, yeah.<br>- I just gotten to know<br>a guy named Primeagen.<br>You can order coffee via SSH.<br>- (chuckles) Oh, yeah.<br>- We did that.<br>- We did that.<br>- So, can also the model do that<br>and provide you with caffeine?<br>Okay, so that's the general framework.<br>- Yeah.<br>- Programming is this weird discipline<br>where sometimes the next<br>five minutes, not always,<br>but sometimes the next five<br>minutes of what you're gonna do<br>is actually predictable from<br>the stuff you've done recently.<br>And so, can you get to a world<br>where that next five minutes<br>either happens by you disengaging<br>and it taking you through?<br>Or maybe a little bit more<br>of just you seeing next step<br>what it's gonna do and you're like,<br>"Okay, that's good, that's<br>good, that's good, that's good,"<br>and you can just tap, tap<br>through these big changes.<br>- As we're talking about this,<br>I should mention one of the really cool<br>and noticeable things about Cursor is that<br>there's this whole diff<br>interface situation going on.<br>So, the model suggests<br>with the red and the green<br>of here's how we're gonna modify the code,<br>and in the chat window you can apply<br>and it shows you the diff<br>and you can accept the diff.<br>So, maybe can you speak to<br>whatever direction of that?<br>- We'll probably have four or<br>five different kinds of diffs.<br>So we have optimized the<br>diff for the auto complete,<br>so that has a different diff interface<br>than when you're reviewing<br>larger blocks of code.<br>And then we're trying to<br>optimize another diff thing<br>for when you're doing<br>multiple different files.<br>And at a high level, the difference is<br>for when you're doing auto-complete,<br>it should be really, really fast to read.<br>Actually, it should be really fast to read<br>in all situations,<br>but in auto-complete your<br>eyes are focused in one area.<br>The humans can't look in<br>too many different places.<br>- So, you're talking about<br>on the interface side?<br>- On the interface side.<br>So it currently has this box on this side.<br>So we have the current box,<br>and it you tries to delete code<br>in some place and tries to add other code,<br>it tries to show you a box on the side.<br>- You can maybe show it if<br>we pull it up in cursor.com.<br>This is what we're talking.<br>- So that box-<br>- Exactly here.<br>- It was like three or<br>four different attempts<br>at trying to make this thing work<br>where first the attempt was<br>this blue crossed out line.<br>So before it was a box on the side,<br>it used to show you the code to delete<br>by showing you Google Docs style,<br>you would see a line through it<br>and then you would see the new code.<br>That was super distracting.<br>There was deletions, there<br>was trying the red highlight.<br>Then the next iteration of<br>it, which is sort of funny,<br>you would hold the, on<br>Mac, the Option button.<br>So, it would highlight a region of code<br>to show you that there<br>might be something coming.<br>So, maybe in this example,<br>the input and the value<br>would all get blue.<br>And the blue was to highlight that the AI<br>had a suggestion for you.<br>So instead of directly<br>showing you the thing,<br>it would just hint that<br>the AI had a suggestion<br>and if you really wanted to see it,<br>you would hold the Option button,<br>and then you would see the new suggestion.<br>And if you release the Option button,<br>you would then see your original code.<br>- Mm-hmm, by the way, that's pretty nice,<br>but you have to know to<br>hold the Option button.<br>- Yeah.<br>- And by the way, I'm not a<br>Mac user, but I got it, Option.<br>It's a button I guess you people have.<br>- Again, it's just not intuitive.<br>I think that's the key thing.<br>- And there's a chance<br>this is also not the final version of it.<br>- I am personally very excited<br>for making a lot of<br>improvements in this area.<br>We often talk about it as<br>the verification problem<br>where these diffs are<br>great for small edits.<br>For large edits or when it's<br>multiple files or something,<br>it's actually a little bit prohibitive<br>to review these diffs.<br>So, there are a couple<br>of different ideas here.<br>One idea that we have is,<br>okay, parts of the diffs are important.<br>They have a lot of information.<br>And then parts of the diff<br>are just very low entropy.<br>They're the same thing<br>over and over again.<br>And so maybe you can<br>highlight the important pieces<br>and then gray out the<br>not so important pieces.<br>Or maybe you can have a<br>model that looks at the diff<br>and sees, oh, there's a likely bug here.<br>I will mark this with a<br>little red squiggly and say,<br>"You should probably review<br>this part of the diff."<br>Ideas in that vein I think are exciting.<br>- Yeah, that's a really fascinating space<br>of UX design engineering.<br>So you're, basically, trying<br>to guide the human programmer<br>through all the things they need to read<br>and nothing more, optimally.<br>- Yeah, and you want an<br>intelligent model to do it.<br>Currently, diff algorithms,<br>they're just like normal algorithms.<br>There's no intelligence.<br>There's intelligence that went<br>into designing the algorithm,<br>but then you don't care<br>if it's about this thing<br>or this thing as you want<br>the model to do this.<br>- So, I think the<br>general question is like,<br>man, these models are<br>going to get much smarter.<br>As the models get much smarter,<br>changes they will be able<br>to propose are much bigger.<br>So as the changes gets<br>bigger and bigger and bigger,<br>the humans have to do more and more<br>and more verification work.<br>You need to help them out.<br>I don't wanna spend all<br>my time reviewing code.<br>- Can you say a little more<br>across multiple files diff?<br>- Yeah, I mean, so GitHub<br>tries to solve this, right,<br>with code review.<br>When you're doing code review,<br>you're reviewing multiple<br>diffs across multiple files.<br>But like Arvid said earlier,<br>I think you can do much<br>better than code review.<br>Code review kind of sucks.<br>You spend a lot of time<br>trying to grok this code<br>that's often quite unfamiliar to you<br>and it often doesn't even<br>actually catch that many bugs.<br>And I think you can significantly improve<br>that review experience using<br>language models, for example,<br>using the kinds of tricks<br>that Arvid had described<br>of maybe pointing you towards the regions<br>that actually matter.<br>I think also if the code is produced<br>by these language models<br>and it's not produced by someone else.<br>The code review experience is<br>design for both the reviewer<br>and the person that produced the code.<br>In the case where the person<br>that produced the code<br>is a language model,<br>you don't have to care that<br>much about their experience<br>and you can design the entire thing<br>around the reviewer such that<br>the reviewer's job is as fun,<br>as easy, as productive as possible.<br>I think that feels like the<br>issue with just naively trying<br>to make these things<br>look like code review.<br>I think you can be a lot more creative<br>and push the boundary on what's possible.<br>- And just one idea there<br>is, I think ordering matters.<br>Generally, when you review a<br>PR, you have this list of files<br>and you're reviewing<br>them from top to bottom,<br>but you actually wanna<br>understand this part first<br>because that came logically first,<br>and then you want to<br>understand the next part.<br>And you don't want to have<br>to figure out that yourself.<br>You want a model to guide<br>you through the thing.<br>- And is the step of<br>creation going to be more<br>and more natural language,<br>is the goal versus with<br>actual writing the book?<br>- I think sometimes. I don't<br>think it's going to be the case<br>that all of programming<br>will be natural language,<br>and the reason for that is<br>if I'm pair programming with Sualeh,<br>and Sualeh is at the<br>computer and the keyboard,<br>and sometimes if I'm driving,<br>I want to say to Sualeh,<br>"Hey, implement this<br>function," and that works.<br>And then sometimes it's just so annoying<br>to explain to Sualeh<br>what I want him to do,<br>and so I actually take over<br>the keyboard and I show him.<br>I write part of the example<br>and then it makes sense<br>and that's the easiest way to communicate.<br>And so, I think that's<br>also the case for AI.<br>Sometimes the easiest way<br>to communicate with the AI<br>will be to show an example<br>and then it goes and does<br>the thing everywhere else.<br>Or sometimes if you're making<br>a website, for example,<br>the easiest way to show<br>to the AI what you want<br>is not to tell it what to do<br>but drag things around or draw things,<br>and maybe eventually,<br>we will get to brain machine<br>interfaces or whatever<br>and you can understand<br>what you're thinking.<br>And so, I think natural<br>language will have a place.<br>I think it will definitely not be the way<br>most people program most of the time.<br>- I'm really feeling the<br>AGI with this editor.<br>(group chuckling) It<br>feels like there's a lot<br>of machine learning going on underneath.<br>Tell me about some of the ML<br>stuff that makes it all work?<br>- Where Cursor really works<br>via this ensemble of custom models<br>that we've trained alongside<br>the frontier models<br>that are fantastic at the<br>reasoning intense things.<br>And so Cursor Tab, for<br>example, is a great example<br>of where you can specialize<br>this model to be,<br>even better than even frontier models<br>if you look at evals on<br>the task we set it at.<br>The other domain, which it's surprising<br>that it requires custom models<br>but it's necessary and works<br>quite well, is in Apply.<br>The frontier models are quite<br>good at sketching out plans<br>for code and generating<br>rough sketches of the change,<br>but actually, creating diffs is quite hard<br>for frontier models, for<br>your training models.<br>You try to do this with Sonnet,<br>with o1, any frontier model<br>and it really messes up stupid things<br>like counting line numbers,<br>especially in super, super large files.<br>And so what we've done to alleviate this<br>is we let the model sketch<br>out this rough code block<br>that indicates what the change will be<br>and we train a model to then<br>Apply that change to the file.<br>- And we should say that<br>Apply is the model looks<br>at your code, it gives you a<br>really damn good suggestion<br>of what new things to do.<br>And the seemingly for humans trivial step<br>of combining the two, you're<br>saying is not so trivial.<br>- Contrary to popular perception,<br>it is not a deterministic algorithm.<br>- Yeah, I think you see shallow<br>copies of Apply elsewhere<br>and it just breaks most of the time<br>because you think you can try to do<br>some deterministic matching,<br>and then it fails at least 40% of the time<br>and that just results in a<br>terrible product experience.<br>I think in general, this regime of you<br>are going to get smarter<br>and smarter models.<br>So one other thing that Apply lets you do<br>is it lets you use fewer tokens<br>with the most intelligent models.<br>This is both expensive in terms of latency<br>for generating all these tokens and cost.<br>So, you can give this<br>very, very rough sketch<br>and then have your model<br>models go and implement it<br>because it's a much<br>easier task to implement,<br>this very, very sketched out code.<br>And I think that this regime will continue<br>where you can use smarter<br>and smarter models<br>to do the planning and then<br>maybe the implementation details<br>can be handled by the<br>less intelligent ones.<br>Perhaps you'll have maybe o1,<br>maybe it'll be even more capable models<br>given an even higher level plan<br>that is recursively applied by Sauna<br>and then the Apply model.<br>- Maybe we should talk<br>about how to make it fast<br>if you like.<br>Fast is always an interesting detail.<br>- [Arvid] Fast is good.<br>- Yeah, how do you make it fast?<br>- Yeah, so one big component<br>of making it fast is speculative edits.<br>So, speculative edits are a<br>variant of speculative decoding,<br>and maybe it'd be helpful<br>to briefly describe speculative decoding.<br>With speculative decoding,<br>what you do is you can<br>take advantage of the fact<br>that most of the time,<br>and I'll add the caveat<br>that it would be when you're memory bound<br>in language model generation,<br>if you process multiple tokens at once,<br>it is faster than generating<br>one token at a time.<br>So this is the same reason why<br>if you look at tokens per<br>second with prompt tokens<br>versus generated tokens,<br>it's much much faster for prompt tokens.<br>So what we do is instead of using<br>what speculative decoding normally does,<br>which is using a really small model<br>to predict these draft<br>tokens that your larger model<br>will then go in and verify,<br>with code edits, we<br>have a very strong prior<br>of what the existing code will look like,<br>and that prior is literally<br>the same exact code.<br>So you can do is you can just feed chunks<br>of the original code back into the model,<br>and then the model will<br>just pretty much agree<br>most of the time that,<br>"Okay, I'm just gonna<br>spit this code back out."<br>And so, you can process all<br>of those lines in parallel<br>and you just do this with<br>sufficiently many chunks.<br>And then eventually, you'll<br>reach a point of disagreement<br>where the model will now<br>predict text that is different<br>from the ground truth original code.<br>It'll generate those tokens<br>and then we will decide<br>after enough tokens<br>match the original code<br>to re-start speculating in chunks of code.<br>What this actually ends up looking like<br>is just a much faster version<br>of normal editing code.<br>So, it looks like a much faster version<br>of the model rewriting all the code.<br>So, we can use the same exact interface<br>that we use for diffs,<br>but it will just stream down a lot faster.<br>- And then the advantage is<br>that while it's streaming,<br>you can just also start reviewing the code<br>before it's done so there's<br>no big loading screen.<br>Maybe that is part of the advantage.<br>- So, the human can start<br>reading before the thing is done.<br>- I think the interesting<br>riff here is something like,<br>I feel like speculation is a<br>fairly common idea nowadays.<br>It's not only in language models.<br>There's obviously speculation in CPUs<br>and there's speculation for databases<br>and there's speculation<br>all over the place.<br>- Well, let me ask the ridiculous question<br>of which LLM is better at coding?<br>GPT, Claude, who wins in<br>the context of programming?<br>And I'm sure the answer<br>is much more nuanced<br>because it sounds like every single part<br>of this involves a different model.<br>- Yeah, I think there's no model<br>that Pareto dominates others,<br>meaning, it is better in all categories<br>that we think matter, the<br>categories being speed,<br>ability to edit code, ability<br>to process lots of code,<br>long context, a couple of other things<br>and coding capabilities.<br>The one that I'd say right now<br>is just net best is Sonnet.<br>I think this is a consensus opinion.<br>O1's really interesting and<br>it's really good at reasoning.<br>So if you give it really hard<br>programming interview style<br>problems or lead code problems,<br>it can do quite well on them,<br>but it doesn't feel like it<br>understands your rough intent<br>as well as Sonnet does.<br>If you look at a lot of<br>the other frontier models,<br>one qualm I have is it feels like<br>they're not necessarily over,<br>I'm not saying they train on benchmarks,<br>but they perform really<br>well in benchmarks relative<br>to everything that's in the middle.<br>So if you tried on all these benchmarks<br>and things that are in the distribution<br>of the benchmarks they're evaluated on,<br>they'll do really well.<br>But when you push them a<br>little bit outside of that,<br>Sonnet is I think the one that does best<br>at maintaining that same capability.<br>You have the same<br>capability in the benchmark<br>as when you try to instruct<br>it to do anything with coding.<br>- Another ridiculous<br>question is the difference<br>between the normal programming experience<br>versus what benchmarks represent?<br>Where do benchmarks fall<br>short, do you think,<br>when we're evaluating these models?<br>- By the way, that's<br>a really, really hard,<br>critically important detail<br>of how different benchmarks<br>are versus real coding,<br>where real coding, it's<br>not interview style coding.<br>Humans are saying<br>half-broken English sometimes<br>and sometimes you're saying,<br>"Oh, do what I did before."<br>Sometimes you're saying,<br>"Go add this thing and then<br>do this other thing for me<br>and then make this UI element."<br>And then, it's just a lot of<br>things are context dependent.<br>You really want to understand the human<br>and then do what the human<br>wants, as opposed to this,<br>maybe the way to put it abstractly<br>is the interview problems<br>are very well specified.<br>They lean a lot on specification<br>while the human stuff is less specified.<br>- Yeah.<br>I think that this benchmark<br>question is both complicated<br>by what Sualeh just mentioned,<br>and then also what Aman was getting into,<br>there's this problem of the skew<br>between what can you actually model<br>in a benchmark versus real programming,<br>and that can be sometimes<br>hard to encapsulate<br>because it's real programming's very messy<br>and sometimes things<br>aren't super well specified<br>what's correct or what isn't.<br>But then it's also doubly hard<br>because of this public benchmark problem.<br>And that's both because public benchmarks<br>are sometimes hill climbed on,<br>then it's really, really<br>hard to also get the data<br>from the public benchmarks<br>out of the models.<br>And so for instance,<br>one of the most popular<br>agent benchmarks, SWE-Bench,<br>is really, really contaminated<br>in the training data<br>of these foundation models.<br>And so if you ask these foundation models<br>to do a SWE-Bench problem,<br>but you actually don't give<br>them the context of a code base,<br>they can hallucinate the right file pass,<br>they can hallucinate the<br>right function names.<br>And so, it's also just the public aspect<br>of these things is tricky.<br>- Yeah, in that case, it could be trained<br>on the literal issues or<br>pull requests themselves,<br>and maybe the labs will<br>start to do a better job<br>or they've already done a good job<br>at decontaminating those things,<br>but they're not going to<br>omit the actual training data<br>of the repository itself.<br>These are all some of the most<br>popular Python repositories.<br>SimPy is one example.<br>I don't think they're going<br>to handicap their models<br>on SimPy and all these<br>popular Python repositories<br>in order to get true evaluation<br>scores in these benchmarks.<br>- I think that given<br>the dirts in benchmarks,<br>there have been a few<br>interesting crutches that places<br>that build systems with these models<br>or build these models actually use<br>to get a sense of are they going<br>the right direction or not.<br>And in a lot of places,<br>people will actually just have<br>humans play with the things<br>and give qualitative feedback on these.<br>One or two of the<br>foundation model companies,<br>they have people that's<br>a big part of their role.<br>And internally, we also<br>qualitatively assess these models<br>and actually lean on that a lot<br>in addition to private<br>emails that we have.<br>- [Arvid] It's like the vibe.<br>- The vibe, yeah, the vibe.<br>- It's like the vibe.<br>- The vibe benchmark, human<br>benchmark, the humans.<br>You pull in the humans to do a vibe check.<br>- Yeah. (chuckles)<br>- Okay.<br>That's what I do, just<br>reading online forums<br>and Reddit and X.<br>Well, I don't know how to<br>properly load in people's opinions<br>'cause they'll say things like,<br>"I feel like Claude or GPT has<br>gotten dumber," or something.<br>They'll say, "I feel like."<br>And then I sometimes feel like that too,<br>but I wonder if it's the<br>model's problem or mine.<br>- With Claude, there's an<br>interesting take I heard<br>where I think AWS has different chips,<br>and I suspect they have<br>slightly different numerics<br>than Nvidia GPUs,<br>and someone speculated that<br>Claude's degraded performance<br>had to do with maybe using<br>the quantized version<br>that existed on AWS Bedrock<br>versus whatever was<br>running on Anthropics GPUs.<br>- I interview a bunch of people<br>that have conspiracy theories,<br>so I'm glad you spoke to this conspiracy.<br>- Well, it's not like conspiracy<br>theory as much as humans.<br>Humans are humans and<br>there's these details.<br>- [Lex] Yes.<br>- And you're doing this<br>queasy amount of flops<br>and chips are messy and<br>man, you can just have bugs.<br>It's hard to overstate how<br>hard bugs are to avoid.<br>- What's the role of a<br>good prompt in all of this?<br>We mentioned that benchmarks<br>have really structured,<br>well-formulated prompts.<br>What should a human be<br>doing to maximize success<br>and what's the importance<br>of what the humans,<br>you wrote a blog post, you<br>called it Prompt Design.<br>- Yeah, I think it depends<br>on which model you're using,<br>and all of them are slightly different<br>and they respond differently<br>to different prompts,<br>but I think the original GPT-4<br>and the original (indistinct)<br>models last year,<br>they were quite sensitive to the prompts,<br>and they also had a very<br>small context window.<br>And so, we have all of<br>these pieces of information<br>around the code base<br>that would maybe be<br>relevant in the prompt.<br>You have the docs, you have<br>the files that you add,<br>you have the conversation history,<br>and then there's a problem<br>like how do you decide<br>what you actually put in the prompt<br>and when you have a limited space?<br>And even for today's models,<br>even when you have long context,<br>filling out the entire<br>context window means<br>that it's slower.<br>It means that sometimes the<br>model actually gets confused<br>and some models get more<br>confused than others.<br>And we have this one system<br>internally that we call Preempt,<br>which helps us with that a little bit.<br>And I think it was<br>built for the era before<br>where we had 8,000 token contact windows.<br>And it's a little bit similar<br>to when you're making a website.<br>You want it to work on mobile,<br>you want it to work on a desktop screen,<br>and you have this dynamic<br>information which you don't have.<br>For example, if you're<br>designing a print magazine,<br>you know exactly where you can put stuff.<br>But when you have a website<br>or when you have a prompt,<br>you have these inputs and<br>then you need to format them<br>to always work, even if<br>the input is really big,<br>then you might have to cut something down.<br>And so the idea was, okay,<br>let's take some inspiration.<br>What's the best way to design websites?<br>Well, the thing that<br>we really like is React<br>and the declarative approach<br>where you use JSX in JavaScript,<br>and then you declare,<br>"This is what I want and I<br>think this has higher priority<br>or this has higher Z index<br>than something else."<br>And then, you have this<br>rendering engine in web design.<br>It's like Chrome, and in our<br>case it's a preempt renderer,<br>which then fits everything onto the page.<br>And as you declare, decide what you want<br>and then it figures out what you want.<br>And so, we have found<br>that to be quite helpful<br>and I think the role of<br>it has shifted over time<br>where initially it was to fit<br>to these small context windows.<br>Now, it's really useful<br>because it helps us with<br>splitting up the data<br>that goes into the prompt and<br>the actual rendering of it.<br>And so, it's easier to debug<br>because you can change the<br>rendering of the prompt<br>and then try it on old prompts<br>because you have the raw data<br>that went into the prompt,<br>and then you can see, "Did<br>my change actually improve it<br>for this entire eval set?"<br>- So, do you literally prompt with JSX?<br>- Yes, yes.<br>- Yeah.<br>- So it looks like React,<br>there are components.<br>We have one component<br>that's a file component<br>and it takes in the Cursor.<br>Usually, there's one line where<br>the Cursor is in your file<br>and that's probably<br>the most important line<br>because that's the one you're looking at.<br>And so, then you can give priorities,<br>so that line has the highest priority,<br>and then you subtract one for every line<br>that is farther away.<br>And then eventually, when it's rendered,<br>it figures out how many<br>lines can actually fit<br>and it centers around that thing.<br>- That's amazing.<br>- Yeah.<br>- And you can do other fancy things<br>where if you have lots of code blocks<br>from the entire code base,<br>you could use retrieval<br>and things like embedding<br>and re-ranking scores<br>to add priorities for you<br>through these components.<br>- So should humans when<br>they ask questions,<br>also try to use something like that?<br>Would it be beneficial to<br>write JSX in the problem<br>or the whole idea is this<br>should be loose and messy?<br>- I think our goal is that<br>you should just do whatever<br>is the most natural thing for you,<br>and then our job is to figure out<br>how do we actually retrieve<br>the relative event things<br>so that your thinking<br>actually makes sense?<br>- Well, this is the discussion I had<br>with Aravind of Perplexity<br>is his whole idea<br>is you should let the person<br>be as lazy as he wants.<br>- Yeah.<br>- Mm-hmm.<br>- Yeah, that's a beautiful thing,<br>but I feel like you're allowed<br>to ask more of programmers, right?<br>- Yes.<br>- So if you say, "Just do what you want,"<br>I mean, humans are lazy.<br>There's a tension between just being lazy<br>versus provide more as be prompted,<br>almost like the system pressuring you<br>or inspiring you to be articulate.<br>Not in terms of the<br>grammar of the sentences,<br>but in terms of the depth of thoughts<br>that you convey inside the prompts.<br>- I think even as a system gets closer<br>to some level of perfection,<br>often when you ask the<br>model for something,<br>not enough intent is<br>conveyed to know what to do.<br>And there are a few ways<br>to resolve that intent.<br>One is the simple thing of<br>having the model just ask you,<br>"I'm not sure how to do these<br>parts based on your query.<br>Could you clarify that?"<br>I think the other could be maybe<br>if there are five or six<br>possible generations,<br>"Given the uncertainty<br>present in your query so far,<br>why don't we just actually<br>show you all of those<br>and let you pick them?"<br>- How hard is it for the<br>model to choose to talk back?<br>It's hard, how deal with the uncertainty.<br>Do I choose to ask for more information<br>to reduce the ambiguity?<br>- So, I mean, one of the things we do,<br>it's like a recent addition,<br>is try to suggest files that you can add.<br>And while you're typing,<br>one can guess what the uncertainty is<br>and maybe suggest that maybe<br>you're writing your API<br>and we can guess using the commits<br>that you've made<br>previously in the same file<br>that the client and the<br>server is super useful<br>and there's a hard technical problem<br>of how do you resolve<br>it across all commits?<br>Which files are the most important<br>given your current prompt?<br>And we're still initial<br>version is ruled out<br>and I'm sure we can make<br>it much more accurate.<br>It's very experimental, but<br>then the idea is we show you,<br>do you just want to add this<br>file, this file, this file also<br>to tell the model to<br>edit those files for you?<br>Because if maybe you're making the API,<br>you should also edit the<br>client and the server<br>that is using the API and the<br>other one resolving the API.<br>So that would be cool as<br>both there's the phase<br>where you're writing a prompt.<br>Before you even click, "Enter,"<br>maybe we can help resolve<br>some of the uncertainty.<br>- To what degree do you<br>use agentic approaches?<br>How useful are agents?<br>- We think agents are really, really cool.<br>- [Lex] (chuckles) Okay.<br>- I think agents, it's like<br>resembles like a human.<br>You can feel that you're<br>getting closer to AGI<br>because you see a demo where<br>it acts as a human would<br>and it's really, really cool.<br>I think agents are not yet<br>super useful for many things.<br>I think we're getting close<br>to where they will actually be useful.<br>And so, I think there are<br>certain types of tasks<br>where having an agent<br>would be really nice.<br>I would love to have an agent.<br>For example, if we have a bug<br>where you sometimes can't<br>Command+C and Command+V<br>inside our chat input box,<br>and that's a task that's<br>super well specified.<br>I just want to say in two sentences,<br>"This does not work, please fix it."<br>And then I would love to have an agent<br>that just goes off, does it,<br>and then a day later, I come<br>back and I review the thing.<br>- You mean it goes, finds the right file?<br>- Yeah, it finds the right files,<br>it tries to reproduce the bug,<br>it fixes the bug and then it<br>verifies that it's correct.<br>And this could be a process<br>that takes a long time.<br>And so, I think I would love to have that.<br>And then I think a lot of programming,<br>there is often this belief<br>that agents will take<br>over all of programming.<br>I don't think we think<br>that that's the case<br>because a lot of programming,<br>a lot of the value is in iterating,<br>or you don't actually want<br>to specify something upfront<br>because you don't really<br>know what you want<br>until you have seen an initial version<br>and then you want to iterate on that,<br>and then you provide more information.<br>And so, for a lot of programming,<br>I think you actually want<br>a system that's instant,<br>that gives you an initial<br>version instantly back<br>and then you can iterate<br>super, super quickly.<br>- What about something like<br>that recently came out,<br>replica agent, that does also setting up<br>the development environment<br>and solving software packages,<br>configuring everything,<br>configuring the databases<br>and actually deploying the app.<br>Is that also in the set<br>of things you dream about?<br>- I think so.<br>I think that would be really cool.<br>For certain types of programming,<br>it would be really cool.<br>- Is that within scope of Cursor?<br>- Yeah, we aren't actively<br>working on it right now.<br>We want to make the<br>programmer's life easier<br>and more fun and some things<br>are just really tedious<br>and you need to go<br>through a bunch of steps<br>and you want to delegate that to an agent.<br>And then some things you<br>can actually have an agent<br>in the background while you're working.<br>Let's say you have a PR that's<br>both backend and frontend,<br>and you're working in the frontend,<br>and then you can have a background agent<br>that doesn't work and figure<br>out what you're doing.<br>And then, when you get to<br>the backend part of your PR,<br>then you have some initial piece of code<br>that you can iterate on.<br>And so that would also be really cool.<br>- One of the things we<br>already talked about is speed,<br>but I wonder if we can just<br>linger on that some more<br>in the various places that<br>the technical details involved<br>in making this thing really fast.<br>So every single aspect of Cursor,<br>most aspects of Cursor feel really fast.<br>Like I mentioned, the Apply<br>is probably the slowest thing.<br>I'm sorry, the pain on<br>Arvid's face as I say that.<br>- I know.<br>It's a pain, it's a<br>pain that we're feeling<br>and we're working on fixing it.<br>(Arvid and Lex chuckling)<br>- Yeah, it says something that feels,<br>I don't know what it is, like<br>one second or two seconds,<br>that feels slow.<br>That means that actually shows<br>that everything else is<br>just really, really fast.<br>So, is there some technical details about<br>how to make some of these models,<br>how to make the chat fast,<br>how to make the diffs fast?<br>Is there something that<br>just jumps to mind?<br>- Yeah.<br>So, we can go over a lot of<br>the strategies that we use.<br>One interesting thing is cache warming.<br>You're probably going to<br>use some piece of context<br>and you can know that before<br>the user's done typing.<br>So as we discussed before,<br>reusing the KV cache<br>results in lower latency,<br>lower costs, cross requests.<br>So as the user starts typing,<br>you can immediately warm the cache<br>with let's say the current file contents,<br>and then when they press Enter,<br>there's very few tokens it<br>actually has to pre-fill<br>and compute before<br>starting the generation.<br>This will significantly lower TTFT.<br>- Can you explain how KV cache works?<br>- [Aman] Yeah, so the<br>way transformers work.<br>(group chuckling)<br>- I like it.<br>(group chuckling)<br>- The mechanism that allows transformers<br>to not just independently<br>look at each token,<br>but see previous tokens are the keys<br>and values to attention.<br>And generally, the way attention works<br>is you have at your<br>current token some query,<br>and then you've all the keys<br>and values of all your previous tokens,<br>which are some kind of representation<br>that the model stores internally<br>of all the previous tokens<br>in the prompt.<br>And by default, when you're<br>doing a chat, the model has to,<br>for every single token,<br>do this forward pass<br>through the entire model.<br>That's a lot of matrix<br>multiplies that happen,<br>and that is really, really slow.<br>Instead, if you have already done that<br>and you stored the keys and values<br>and you keep that in the GPU,<br>let's say I have to sort<br>it for the last N tokens.<br>If I now wanna compute the output token<br>for the N+1nth token,<br>I don't need to pass those first N tokens<br>through the entire model<br>because I already have<br>all those keys and values.<br>And so, you just need<br>to do the forward pass<br>through that last token.<br>And then when you're doing attention,<br>you're reusing those keys and values<br>that have been computed,<br>which is the only kind of sequential part<br>or sequentially dependent<br>part of the transformer.<br>- Is there higher level caching<br>of caching of the prompts<br>or that kind of stuff<br>that could help?<br>- I see.<br>Yeah, there's other types<br>of caching you can do.<br>One interesting thing that<br>you can do for Cursor Tab<br>is you can basically predict ahead<br>as if the user would've<br>accepted the suggestion<br>and then trigger another request.<br>And so then you've cached,<br>you've done the speculative.<br>It's a mix of speculation<br>and caching, right?<br>Because speculating what would<br>happen if they accepted it.<br>And then you have this value<br>that is cached this suggestion.<br>And then when they press Tab,<br>the next one would be<br>waiting for them immediately.<br>It's a clever heuristic/trick<br>that uses a higher level caching.<br>It feels fast despite there<br>not actually being any changes<br>in the model.<br>- And if you can make<br>the KV cache smaller,<br>one of the advantages you get<br>is like maybe you can speculate even more.<br>Maybe you can guess,<br>"Here's the 10 things<br>that could be useful,<br>predict the next 10,"<br>and then it's possible the<br>user hits the one of the 10.<br>It's much higher chance than<br>the user hits the exact one<br>that you showed them.<br>Maybe they type in other character<br>and hit something else in the cache.<br>The general phenomena here is,<br>I think it's also super useful for RL<br>is maybe a single sample from<br>the model isn't very good,<br>but if you predict 10 different things,<br>turns out that one of the 10<br>that's right is the<br>probability is much higher.<br>There's these passive K<br>curves and part of RL,<br>what RL does is you can exploit<br>this passive K phenomena<br>to make many different predictions.<br>And one way to think about this,<br>the model knows internally<br>has some uncertainty<br>over which of the key things is correct<br>or which of the key things<br>does the human wants?<br>When we RL our Cursor Tab model,<br>one of the things we're<br>doing is we're predicting<br>which of the 100 different<br>suggestions the model produces<br>is more amenable for humans?<br>Which of them do humans<br>more like than other things?<br>Maybe there's something<br>where the model can predict very far ahead<br>versus a little bit, maybe<br>somewhere in the middle.<br>And then you can give<br>a reward to the things<br>that humans would like more<br>and punish the things that it would like,<br>and then train the model<br>to output the suggestions<br>that humans would like more.<br>You have these RL loops<br>that are very useful<br>that exploit these passive K curves.<br>Aman, maybe can go into even more detail.<br>- Yeah, it is a little<br>different than speed,<br>but technically, you tie it back in<br>because you can get away<br>with the smaller model<br>if you RL your smaller model<br>and it gets the same<br>performance as the bigger one.<br>So while I was mentioning stuff about KV,<br>about reducing the size of your KV cache,<br>there are other techniques there as well<br>that are really helpful for speed.<br>So, kind of back in the day,<br>all the way two years ago,<br>people mainly use multi-head attention,<br>and I think there's been a migration<br>towards more efficient attention<br>schemes like group query<br>or multi-query attention,<br>and this is really helpful for<br>then with larger batch sizes<br>being able to generate<br>the tokens much faster.<br>The interesting thing here<br>is this now has no effect<br>on that time to first<br>token pre-fill speed.<br>The thing this matters for<br>is now generating tokens.<br>And why is that?<br>'Cause when you're generating tokens,<br>instead of being bottlenecked<br>by doing these super<br>parallelizable matrix multiplies<br>across all your tokens,<br>you're bottlenecked,<br>for a long context with large batch sizes,<br>by how quickly you can read<br>those cache, keys, and values.<br>And so then that's memory bandwidth,<br>and how can we make this faster?<br>We can try to compress the<br>size of these keys and values.<br>So multi-query attention is<br>the most aggressive of these.<br>Where normally with multi-head attention,<br>you have some number of, quote,<br>unquote, "attention heads,"<br>and some number of query heads.<br>Multi-query just<br>preserves the query heads,<br>gets rid of all the key value heads.<br>So there's only one<br>kind of key value head,<br>and there's all the remaining query heads.<br>With group query, you instead<br>preserve all the query heads.<br>There are fewer heads<br>for the keys and values,<br>but you're not reducing it to just one.<br>But anyways, the whole point here<br>is you're just reducing<br>the size of your KV cache.<br>- And then there is MLA.<br>- Yeah, multi-latent.<br>That's a little more complicated.<br>And the way that this works is<br>it kind of turns the entirety<br>of your keys and values<br>across all your heads<br>into this one latent vector<br>that has then kind of<br>expanded in for its time.<br>- But MLA is from this<br>company called DeepSeek.<br>It's quite an interesting algorithm.<br>Maybe the key idea is in<br>both MQA and in other places,<br>what you're doing is you're<br>reducing the number of KV heads.<br>And the advantage you get from<br>that is there's less of them.<br>You want each of the keys and values<br>to actually be different.<br>So, one way to reduce the size<br>is you keep one big shared vector<br>for all the keys and values,<br>and then you have smaller<br>vectors for every single token.<br>So that you can store the<br>only the smaller thing<br>as some sort of low-rank reduction.<br>At the end of the time,<br>when you eventually wanna<br>compute the final thing,<br>remember that your memory band,<br>which means that you still<br>have some compute left<br>that you can use for these things.<br>And if you can expand the<br>latent vector back out<br>and somehow this is far more efficient<br>because you're reducing, for example,<br>maybe you're reducing vec 32 or something<br>like the size of the<br>vector that you're keeping.<br>- Yeah, there's perhaps some richness<br>in having a separate<br>set of keys and values<br>and query that kind of pairwise match up<br>versus compressing that all into one<br>in that interaction at least.<br>- Okay, and all of that is<br>dealing with being memory bound.<br>- Yeah.<br>- I mean, ultimately,<br>how does that map to the user experience?<br>Trying to get the-<br>- Yeah, the two things that it maps to<br>is you can now make<br>your cache a lot larger<br>because you've less space<br>allocated for the KV cache.<br>You can maybe cache a<br>lot more aggressively<br>in a lot more things, so<br>you get more cache hits,<br>which are helpful for reducing<br>the time to first token<br>for the reasons that were<br>kind of described earlier.<br>And then the second being,<br>when you start doing inference<br>with more and more requests<br>and larger and larger batch sizes,<br>you don't see much of a slowdown<br>as it's generating the<br>tokens at the speed of that.<br>- Well, it also allows you<br>to make your prompt bigger<br>for certain-<br>- Yeah, yeah.<br>So, the size of your KV cache<br>is both the size of all your prompts,<br>multiplied by the number of prompts<br>being processed in parallel.<br>So you could increase either<br>those dimensions, right?<br>The batch size or the size of your prompts<br>without degrading the<br>latency of generating tokens.<br>- Arvid, you wrote a blog post,<br>"Shadow Workspace: Iterating<br>on Code in the Background."<br>So, what's going on?<br>- So, to be clear, we want there<br>to be a lot of stuff<br>happening in the background,<br>and we're experimenting<br>with a lot of things.<br>Right now, we don't have<br>much stuff happening<br>other than the cache warming<br>or figuring out the right context<br>that goes into your command<br>key prompts, for example.<br>But the idea is if you can<br>actually spend computation<br>in the background, then<br>you can help the user<br>maybe at a slightly longer time horizon<br>than just predicting the next few lines<br>that you're gonna make.<br>But actually in the next 10 minutes,<br>what are you going to make?<br>And by doing it in background,<br>you can spend more computation doing that.<br>And so the idea of the Shadow<br>Workspace that we implemented,<br>and we use it internally for experiments<br>is that to actually get advantage<br>of doing stuff in the background,<br>you want some kind of feedback signal<br>to give back to the model<br>because otherwise, you<br>can get higher performance<br>by just letting the<br>model think for longer,<br>and so o1 is a good example of that.<br>But another way you<br>can improve performance<br>is by letting the model<br>iterate and get feedback.<br>And so, one very important<br>piece of feedback<br>when you're a programmer<br>is the language server,<br>which is this thing, it exists<br>for most different languages,<br>and there's a separate<br>language server per language.<br>And it can tell you, "You're<br>using the wrong type here,"<br>and then gives you an error,<br>or it can allow you to go to definition<br>and understands the<br>structure of your code.<br>There is a TypeScript<br>language server developed<br>by the TypeScript people,<br>a Rust language server<br>developed by the Rust people,<br>and then they all interface<br>over the language server<br>protocol to VS Code.<br>So that VS Code doesn't need to have all<br>of the different languages<br>built into VS Code<br>but rather you can use the<br>existing compiler infrastructure.<br>- For linting purposes, what-<br>- It's for linting.<br>It's for going to definition,<br>and for seeing the right<br>types that you're using.<br>- So it's doing type checking also?<br>- Yes, type checking<br>and going to references.<br>And that's like when you're<br>working in a big project,<br>you kind of need that.<br>If you don't have that,<br>it's really hard to code in a big project.<br>- Can you say, again, how<br>that's being used inside Cursor,<br>the language server protocol<br>communication thing?<br>- So it's being used in Cursor<br>to show to the programmer<br>just like in VS Code, but then the idea is<br>you want to show that same<br>information to the models,<br>the IM models, and you<br>want to do that in a way<br>that doesn't affect the user<br>because you want to do it in background.<br>And so the idea behind<br>the Shadow Workspace was,<br>okay, one way we can do this<br>is we spawn a separate window<br>of Cursor that's hidden, and<br>so you can set this flag in it<br>and like turn it's hidden.<br>There is a window but you<br>don't actually see it.<br>And inside of this window,<br>the AI agents can modify code<br>however they want as long<br>as they don't save it<br>because it's still the same folder,<br>and then can get feedback from the linters<br>and go to definition and<br>iterate on their code.<br>- So literally run<br>everything in the background,<br>right, maybe even run the code.<br>- So that's the eventual version<br>and that's what you want.<br>And a lot of the blog<br>post is actually about<br>how do you make that happen<br>because it's a little bit tricky.<br>You want it to be on the user's machine<br>so that it exactly mirrors<br>the user's environment.<br>And then on Linux, you<br>can do this cool thing<br>where you can actually<br>mirror the file system<br>and have the AI make changes to the files,<br>and it thinks that it's<br>operating on the file level,<br>but actually, that's stored in memory<br>and you can create this<br>kernel-like extension<br>to make it work.<br>Whereas on Mac and Windows,<br>it's a little bit more difficult,<br>but it's a fun technical<br>problem, so that's why.<br>- One may be hacky but interesting idea<br>that I like is holding a lock on saving.<br>And so basically, you can<br>then have the language model<br>kind of hold the lock on saving to disk,<br>and then instead of you operating<br>in the ground truth version of the files<br>that are saved to disk,<br>you actually are operating<br>what was the Shadow Workspace before<br>and these unsaved things<br>that only exist in memory<br>that you still get linter<br>errors for, and you can code in.<br>And then when you try to maybe run code,<br>it's just like there's a small<br>warning that there's a lock,<br>and then you kind of<br>will take back the lock<br>from the language server<br>if you're trying to do things concurrently<br>or from the Shadow Workspace<br>if you're trying to do<br>things concurrently.<br>- That's such an exciting<br>future by the way.<br>It's a bit of a tangent,<br>but to allow a model to change files,<br>it's scary for people<br>but it's really cool,<br>to be able to just let the<br>agent do a set of tasks<br>and you come back the next<br>day and kind of observe,<br>like it's a colleague<br>or something like that.<br>- And I think there may be<br>different versions of runability<br>for the simple things<br>where you're doing things<br>in the span of a few minutes<br>on behalf of the user<br>as they're programming, it makes sense<br>to make something work<br>locally in their machine.<br>I think for the more aggressive things<br>where you're making larger changes<br>that take longer periods of time,<br>you'll probably wanna do this<br>in some sandbox remote environment<br>and that's another<br>incredibly tricky problem<br>of how do you exactly reproduce<br>or mostly reproduce to the point of it<br>being effectively<br>equivalent for running code<br>the user's environment<br>with this remote sandbox.<br>- I'm curious what kind of<br>agents you want for coding?<br>Do you want them to find bugs?<br>Do you want them to<br>implement new features?<br>What agents do you want?<br>- So by the way, when<br>I think about agents,<br>I don't think just about coding.<br>I think so for this particular podcast,<br>there's video editing<br>and if you look in Adobe,<br>there's code behind.<br>It's very poorly documented code,<br>but you can interact with<br>Premiere, for example, using code,<br>and basically all the uploading,<br>everything I do on YouTube,<br>everything as you could probably imagine,<br>I do all of that through code<br>and including translation<br>and overdubbing, all of this.<br>So, I envision all of<br>those kinds of tasks.<br>So automating many of the tasks<br>that don't have to do directly<br>with the editing, so that.<br>Okay, that's what I was thinking about.<br>But in terms of coding,<br>I would be fundamentally<br>thinking about bug finding,<br>many levels of kind of bug finding<br>and also bug finding like logical bugs,<br>not logical like spiritual<br>bugs or something.<br>(group chuckling)<br>Ones like big directions<br>of implementation,<br>that kind of stuff.<br>- Magical (indistinct) and bug finding.<br>- Yeah, I mean, it's really interesting<br>that these models are<br>so bad at bug finding<br>when just naively prompted to find a bug.<br>They're incredibly poorly calibrated.<br>- Even the smartest models.<br>- Exactly, even o1.<br>- How do you explain that?<br>Is there a good intuition?<br>- I think these models are<br>really strong reflection<br>of the pre-training distribution,<br>and I do think they generalize<br>as the loss gets lower and lower,<br>but I don't think the loss is low enough<br>such that they're really<br>fully generalizing on code.<br>The things that we use these things for,<br>the frontier models that<br>they're quite good at,<br>are really code generation<br>and question answering.<br>And these things exist in massive<br>quantities in pre-training<br>with all of the code<br>in GitHub on the scale<br>of many, many trillions of<br>tokens and questions and answers<br>on things like stack overflow<br>and maybe GitHub issues.<br>And so, when you try to<br>push one of these things<br>that really don't exist very much online,<br>for example, the Cursor Tab objective<br>of predicting the next edit<br>given the edits done so far,<br>the brittleness kind of shows.<br>And then bug detection<br>is another great example,<br>where there aren't<br>really that many examples<br>of actually detecting real<br>bugs and then proposing fixes<br>and the models just kind<br>of really struggle at it.<br>But I think it's a question<br>of transferring the model<br>in the same way that you<br>get this fantastic transfer<br>from pre-trained models<br>just on code in general<br>to the Cursor Tab objective.<br>You'll see a very, very similar thing<br>with generalized models<br>that are really good at code<br>to bug detection.<br>It just takes a little bit of kind nudging<br>in that direction.<br>- Look, to be clear,<br>I think, they understand code really well.<br>While they're being pre-trained,<br>the representation that's<br>being built up almost certainly<br>like somewhere in the<br>stream, the model knows<br>that maybe there's<br>something sketchy going on.<br>Part of it is that humans<br>are really calibrated<br>on which bugs are really important.<br>It's not just actually saying<br>there's something sketchy.<br>It's like it's this sketchy trivial,<br>it's this sketchy like you're<br>gonna take the server down.<br>Part of it is maybe the cultural knowledge<br>of why is a staff engineer is good<br>because they know that three years ago<br>someone wrote a really<br>sketchy piece of code<br>that took the server down.<br>(group chuckling)<br>This thing is an experiment.<br>So, a few bugs are fine,<br>you're just trying to experiment<br>and get the feel of the thing.<br>And so if the model gets really annoying<br>when you're writing an<br>experiment, that's really bad,<br>but if you're writing<br>something for super production,<br>you're writing a database.<br>You're writing code in<br>Postgres or Linux or whatever.<br>You're Linus Torvalds.<br>It's sort of unacceptable<br>to have even an edge case<br>and just having the calibration<br>of how paranoid is the user.<br>- But even then if you're<br>putting in a maximum paranoia,<br>it still just doesn't quite get it.<br>- Yeah, yeah, yeah.<br>- I mean, but this is hard<br>for humans too to understand<br>which line of code is<br>important, which is not.<br>I think one of your<br>principles on a website says<br>if a code can do a lot of damage,<br>one should add a comment that say,<br>"This line of code is dangerous."<br>- And all caps, repeated 10 times.<br>(group chuckling)<br>- No, you say for every<br>single line of code<br>inside the function you have<br>to, and that's quite profound,<br>that says something about human beings<br>because the engineers move on,<br>even the same person might just forget<br>how it can sink the<br>Titanic a single function.<br>You might not intuit that quite clearly<br>by looking at the single piece of code.<br>- Yeah, and I think that<br>one is partially also<br>for today's AI models<br>where if you actually write<br>dangerous, dangerous, dangerous<br>in every single line,<br>the models will pay more attention to that<br>and will be more likely to<br>find bugs in that region.<br>- That's actually just straight<br>up a really good practice<br>of labeling code of how<br>much damages can do.<br>- Yeah, I mean, it's controversial.<br>Some people think it's ugly.<br>Sualeh does not like it.<br>- In fact, I actually think<br>this is one of the things<br>I learned from Arvid.<br>Aesthetically, I don't like it,<br>but I think there's certainly something<br>where it's useful for the models<br>and humans just forget a lot,<br>and it's really easy to<br>make a small mistake.<br>Just bring down the server.<br>Of course, we test a lot and whatever,<br>but there's always these things<br>that you have to be very careful.<br>- Yeah, like with just normal docstrings,<br>I think people will often just skim it<br>when making a change and think,<br>"Oh, I know how to do this,"<br>and you really need to<br>point it out to them<br>so that doesn't slip through.<br>- Yeah, you have to be reminded<br>that you could do a lot of damage,<br>that's like we don't<br>really think about that.<br>You think about, "Okay, how<br>do I figure out how this works<br>so I can improve it?"<br>You don't think about the<br>other direction that it could-<br>- Until we have formal<br>verification for everything,<br>then you can do whatever you want<br>and you know for certain that<br>you have not introduced a bug<br>if the proof pass.<br>- Well, concretely, what<br>do you think that future<br>would look like?<br>- I think people will just<br>not write to tests anymore.<br>You write a function, the<br>model will suggest a spec,<br>and you review the spec.<br>And in the meantime, smart<br>reasoning model computes a proof<br>that the implementation follows the spec,<br>and I think that happens<br>for most functions.<br>- Do you think this gets at a little bit<br>some of the stuff you<br>were talking about earlier<br>with the difficulty of specifying intent<br>for what you want with software,<br>where sometimes it might be<br>because the intent is<br>really hard to specify,<br>it's also then going to<br>be really hard to prove<br>that it's actually matching<br>whatever your intent is?<br>- You think that spec is hard to generate?<br>- Yeah, or just for a given spec.<br>I think there is a question of,<br>can you actually do the<br>formal verification?<br>Is that possible?<br>I think that there's more to<br>dig into there, but then also-<br>- Even if you have the spec?<br>- If you have the spec-<br>- Even if you have the spec,<br>is the spec written in natural language?<br>Or is it-<br>- No, the spec would be formal.<br>- But how easier would<br>that be (indistinct).<br>- Okay, so then I think<br>that you care about things<br>that are not going to<br>be easily well specified<br>in the spec language.<br>- I see, I see, yeah, yeah.<br>- Would be maybe an argument<br>against formal verification<br>is all you need.<br>- The worry is there's<br>this massive document-<br>- Replacing something<br>like unit tests, sure.<br>- Yeah, yeah.<br>I think you can probably also<br>evolve the spec languages<br>to capture some of the things<br>that they don't really capture right now.<br>I don't know, I think it's very exciting.<br>- And you're speaking not<br>just about single functions,<br>you're speaking about entire code bases.<br>- I think entire code bases is harder,<br>but that is what I would love to have<br>and I think it should be possible.<br>There's a lot of work recently<br>where you can prove formally<br>verified down to the hardware.<br>You formally verify the C code,<br>and then you formally verify<br>through the GCC compiler,<br>and then through the Verilog<br>down to the hardware.<br>And that's incredibly big<br>system, but it actually works.<br>And I think big code bases<br>are sort of similar in that<br>and they're like multi-layered system.<br>And if you can decompose it<br>and formally verify each part,<br>then I think it should be possible.<br>I think this specification<br>problem is a real problem.<br>- How do you handle side<br>effects or how do you handle,<br>I guess, external dependencies<br>like calling the Stripe API?<br>- Maybe Stripe would write<br>a spec for their API.<br>- But you can't do this for everything.<br>Can you do this for everything you use?<br>Maybe people will use<br>language models as primitives<br>in the programs they write,<br>and there's a dependence on it<br>and how do you now include that?<br>- I think you might be<br>able to prove that still.<br>- Prove what about language models?<br>- I think it feels possible<br>that you could actually prove<br>that a language model<br>is aligned, for example,<br>or you can prove that it<br>actually gives the right answer.<br>- That's the dream.<br>- Yeah, I mean, if it's possible.<br>That's your I have a dream speech.<br>If it's possible, that will certainly help<br>with making sure your<br>code doesn't have bugs<br>and making sure AI doesn't<br>destroy all human civilization.<br>So, the full spectrum of AI<br>safety to just bug finding.<br>So, you said the models<br>struggle with bug finding.<br>What's the hope?<br>- My hope initially is, and I<br>can let Michael chime in too,<br>but it was like it should first<br>help with the stupid bugs.<br>It should query quickly,<br>catch the stupid bugs off by one error.<br>Sometimes you write something in a comment<br>and do the other way.<br>It's very common.<br>I do this.<br>I write less than in a comment<br>and I maybe write the greater<br>than or something like that.<br>And the model is like,<br>"Yeah, you looks sketchy.<br>You sure you wanna do that?"<br>But eventually, it should be<br>able to catch harder bugs too.<br>- Yeah, and I think that<br>it's also important to note<br>that having good bug, finding<br>models feels necessary<br>to get to the highest reaches<br>of having AI do more and<br>more programming for you.<br>If AI is building more and<br>more of the system for you,<br>you need to not just<br>generate but also verify.<br>And without that, some of the problems<br>that we've talked about<br>before with programming,<br>with these models will<br>just become untenable.<br>So it's not just for humans<br>like you write a bug,<br>I write a bug, find the bug for me,<br>but it's also being able<br>to verify the AI's code<br>and check it is really important.<br>- Yeah, and then how do<br>you actually do this?<br>We have had a lot of<br>contentious dinner discussions<br>of how do you actually train a bug model,<br>but one very popular idea<br>is it's potentially easy<br>to introduce a bug than<br>actually finding the bug.<br>And so, you can train a<br>model to introduce bugs<br>in existing code,<br>and then you can train<br>a reverse bug model then<br>that can find bugs using<br>this synthetic data.<br>So that's one example,<br>but there are lots of ideas<br>for how to (indistinct).<br>- You can also do a bunch of work<br>not even at the model level<br>of taking the biggest models<br>and then maybe giving them<br>access to a lot of information<br>that's not just the code.<br>It's a hard problem to<br>stare at a file and be like,<br>"Where's the bug?"<br>And that's hard for humans often, right?<br>And so often, you have to run the code<br>and being able to see things like traces<br>and step through a debugger,<br>there's another whole other direction<br>where it tends toward that.<br>- It could also be<br>that there are two different<br>product form factors here.<br>It could be that you have<br>a really specialty model<br>that's quite fast that's<br>running in the background<br>and trying to spot bugs.<br>And it might be that sometimes<br>to Arvid's earlier example<br>about some nefarious input box bug.<br>You know there's a bug,<br>you're not just checking hypothesis free,<br>you're like, "This is a problem,<br>I really wanna solve it,"<br>and you zap that with tons<br>and tons and tons of compute,<br>and you're willing to put<br>in $50 to solve that bug<br>or something even more.<br>- Have you thought about integrating money<br>into this whole thing?<br>I would pay probably a<br>large amount of money<br>if you found a bug or even generated code<br>that I really appreciated.<br>I had a moment a few days ago<br>when I started using Cursor<br>where it generated perfect three functions<br>for interacting with the<br>YouTube API to update captions<br>for localization in different languages.<br>The API documentation is not<br>very good and the code across.<br>I googled it for a while.<br>I couldn't find exactly,<br>there's a lot of confusing information,<br>and Cursor generated perfectly.<br>I just sit back, I read<br>the code, I was like,<br>"This is correct, I<br>tested it, it's correct."<br>I was like, "I wanna tip."<br>I want a button that goes, "Here's $5."<br>One that's really good<br>just to support the company<br>and support what the interface is.<br>And the other is that<br>probably sends a strong signal<br>like good job.<br>(all chuckling)<br>So, there's this much stronger signal<br>than just accepting the code, right?<br>You just actually send a strong good job.<br>That and for bug finding, obviously,<br>there's a lot of people<br>that would pay a huge amount of money<br>for a bug bounty thing, right?<br>You guys think about that?<br>- Yeah, it's a controversial<br>idea inside the company.<br>I think it depends<br>on how much you believe<br>in humanity almost.<br>I think it would be really cool<br>if you spend nothing to try to find a bug.<br>And if it doesn't find<br>a bug, you spend $0.<br>And then if it does find a<br>bug and you click accept,<br>then it also shows in parentheses like $1.<br>And so, you spend $1 to accept the bug.<br>And then, of course, there's a worry like,<br>"Okay, we spent a lot of computation,<br>maybe people will just copy paste."<br>I think that's a worry.<br>Then there is also the<br>worry that introducing money<br>into the product.<br>It doesn't feel as fun anymore.<br>You have to think about money.<br>And all you want to<br>think about is the code,<br>and so maybe it actually makes more sense<br>to separate it out, and you<br>pay some fee every month,<br>and then you get all of<br>these things for free.<br>- But there could be a tipping component<br>which is not like it cost this-<br>- Yes, but it still<br>has that dollar symbol.<br>I think it's fine, but<br>I also see the point<br>where maybe you don't<br>want to introduce it.<br>- Yeah, I was gonna say the moment<br>that feels like people do<br>this is when they share it.<br>When they have this fantastic example,<br>they just share it with their friends.<br>- There is also a potential world<br>where there's a technical solution to this<br>like honor system problem too,<br>where if we can get to a place<br>where we understand the<br>output of the system more,<br>I mean, to the stuff we were talking about<br>with error checking with the LSP<br>and then also running the code.<br>But if you could get to a place<br>where you could actually somehow verify,<br>"Oh, I have fixed the bug,"<br>maybe then the bounty system<br>doesn't need to rely on<br>the honor system too.<br>- How much interaction is there<br>between the terminal and the code?<br>How much information is gained from<br>if you run the code in the terminal?<br>Can you do a loop where it runs the code<br>and suggests how to change the code?<br>If the code and runtime gets an error?<br>Is right now there's<br>separate worlds completely?<br>I know you can do control<br>K inside the terminal<br>to help you write the code.<br>- You can use terminal context as well<br>inside of check Command+K<br>kind of everything.<br>We don't have the looping part yet,<br>so we suspect something like<br>this could make a lot of sense.<br>There's a question of whether it happens<br>in the foreground too or if<br>it happens in the background<br>like what we've been discussing.<br>- Sure, the background's pretty cool.<br>I could be running the<br>code in different ways.<br>Plus there's a database side to this,<br>which how do you protect it<br>from not modifying the database,<br>but okay.<br>(group chuckling)<br>- I mean, there's certainly<br>cool solutions there.<br>There's this new API<br>that is being developed.<br>It's not in AWS, but<br>it certainly, I think,<br>it's in PlanetScale.<br>I don't know if PlanetScale was<br>the first one to you add it.<br>It's this ability sort of<br>add branches to a database,<br>which is like if you're<br>working on a feature<br>and you wanna test against<br>the broad database,<br>but you don't actually want to test<br>against the broad database,<br>you could add a branch to the database.<br>And the way they do that<br>is they add a branch<br>to the write-ahead log.<br>And there's obviously a<br>lot of technical complexity<br>in doing it correctly.<br>I guess database companies<br>need new things to do.<br>(group chuckling)<br>They have good databases now.<br>And I think turbopuffer,<br>which is one of the databases we use,<br>is going to add maybe branching<br>to the write-ahead log.<br>So maybe the AI agents will use branching,<br>they'll test against some branch,<br>and it's gonna be a<br>requirement for the database<br>to support branching or something.<br>- It would be really interesting<br>if you could branch a file system, right?<br>- Yeah.<br>I feel like everything needs branching.<br>- [Aman] Yeah.<br>- Yeah.<br>The problem with the multiverse, right?<br>(group chuckling)<br>If you branch on everything<br>that's like a lot.<br>- There's obviously these<br>super clever algorithms<br>to make sure that you don't<br>actually use a lot of space<br>or CPU or whatever.<br>- Okay, this is a good place<br>to ask about infrastructure.<br>So, you guys mostly use AWS,<br>what are some interesting details?<br>What are some interesting challenges?<br>Why'd you choose AWS?<br>Why is AWS still winning?<br>Hashtag.<br>- AWS is just really, really good.<br>It is really good.<br>Whenever you use an AWS product,<br>you just know that it's going to work.<br>It might be absolute hell<br>to go through the steps<br>to set it up.<br>- Why is the interface so horrible?<br>- Because it's. (chuckles)<br>- It's just so good.<br>It doesn't need to-<br>- It's the nature of winning.<br>(group chuckling)<br>- I think it's exactly, it's<br>just nature they're winning.<br>- Yeah, yeah.<br>But AWS we can always<br>trust, it will always work.<br>And if there is a problem,<br>it's probably your problem.<br>(Lex chuckles)<br>Yeah.<br>- Okay, is there some<br>interesting challenges,<br>you guys are pretty<br>new startup to scaling,<br>to so many people.<br>- Yeah, I think that it has<br>been an interesting journey<br>adding each extra zero to<br>the request per second.<br>(Lex chuckles)<br>You run into all of these<br>with the general components<br>you're using for caching and databases,<br>run into issues as you make<br>things bigger and bigger,<br>and now we're at the scale<br>where we get into overflows<br>on our tables and things like that.<br>And then, also there have<br>been some custom systems<br>that we've built.<br>For instance, our retrieval<br>system for computing,<br>a semantic index of your code<br>base and answering questions<br>about a code base that have continually,<br>I feel like, been one of the<br>trickier things to scale.<br>- I have a few friends who<br>are super senior engineers<br>and one of their lines is,<br>it's very hard to predict<br>where systems will break<br>when you scale them.<br>You can try to predict in advance,<br>but there's always something<br>weird that's gonna happen<br>when you add these extras here.<br>You thought through everything,<br>which you didn't actually<br>think through everything.<br>But I think for that particular system,<br>we chunk up all of your code,<br>and then we send up the code for embedding<br>and we embed the code.<br>And then, we store the<br>embeddings in a database,<br>but we don't actually<br>store any of the code.<br>And then there's reasons<br>around making sure<br>that we don't introduce client bugs<br>because we're very, very<br>paranoid about client bugs.<br>We store much of the<br>details on the server.<br>Everything is encrypted.<br>So, one of the technical<br>challenges is always making sure<br>that the local index, the local<br>code base state is the same<br>as the state that is on the server.<br>The way, technically, we<br>ended up doing that is,<br>for every single file<br>you can keep this hash,<br>and then for every folder<br>you can keep a hash,<br>which is the hash of all of its children.<br>You can recursively do that until the top.<br>Why do something complicated?<br>One thing you could do<br>is you could keep a hash for every file,<br>and every minute, you could<br>try to download the hashes<br>that are on the server,<br>figure out what are the files<br>that don't exist on the server.<br>Maybe you just created a new file,<br>maybe you just deleted a file,<br>maybe you checked out a new branch,<br>and try to reconcile the state<br>between the client and the server.<br>But that introduces absolutely<br>ginormous network overhead<br>both on the client side.<br>Nobody really wants us to<br>hammer their WiFi all the time<br>if you're using Cursor.<br>But also, it would<br>introduce ginormous overhead<br>on the database.<br>It would be reading these<br>tens of terabytes database,<br>approaching 20 terabytes<br>or something data base every second.<br>That's just crazy.<br>You definitely don't wanna do that.<br>So what you do, you just try<br>to reconcile the single hash,<br>which is at the root of the project.<br>And then if something<br>mismatches, then you go,<br>you find where all the things disagree.<br>Maybe you look at the children<br>and see if the hashes match.<br>If the hashes don't match,<br>go look at their children and so on.<br>But you only do that in the scenario<br>where things don't match.<br>For most people, most of<br>the time, the hashes match.<br>- So it's like a<br>hierarchical reconciliation<br>of hashes.<br>- Yeah, something like that.<br>- Yeah, it's called a Merkle tree.<br>- Yeah, Merkle.<br>- Yeah.<br>- Yeah.<br>This is cool to see<br>that you have to think<br>through all these problems.<br>- The reason it's gotten hard<br>is just because the<br>number of people using it<br>and some of your customers<br>have really, really large code bases.<br>We originally reordered dark<br>code base, which is big,<br>but it's just not the size of some company<br>that's been there for 20 years<br>and has a ginormous number of files<br>and you wanna scale<br>that across programmers.<br>There's all these details<br>where building the simple thing is easy,<br>but scaling it to a lot of people,<br>a lot of companies is<br>obviously a difficult problem,<br>which is independent of, actually,<br>so that there's part of this scaling.<br>Our current solution is also<br>coming up with new ideas<br>that, obviously, we're working on,<br>but then scaling all of that<br>in the last few weeks, months.<br>- Yeah.<br>And there are a lot of clever things,<br>additional things that go<br>into this indexing system.<br>For example, the bottleneck<br>in terms of costs<br>is not soaring things<br>in the vector database<br>or the database, it's<br>actually embedding the code.<br>You don't wanna re-embed the code base<br>for every single person in a company<br>that is using the same exact code<br>except for maybe they're<br>a different branch<br>with a few different files<br>or they've made a few local changes.<br>Because again, embeddings<br>are the bottleneck,<br>you can do this one clever trick<br>and not have to worry about the complexity<br>of dealing with branches<br>and the other databases<br>where you just have some cash<br>on the actual vectors computed<br>from the hash of a given chunk.<br>- Mm-hmm.<br>- So this means that when the<br>nth person at a company goes<br>and embed their code base,<br>it's really, really fast.<br>You do all this without<br>actually storing any code<br>on our servers at all.<br>No code data is stored.<br>We just store the vectors<br>in the vector database<br>and the vector cache.<br>- What's the biggest<br>gains at this time you get<br>from indexing the code base?<br>Just out of curiosity,<br>what benefit do users have?<br>It seems like longer term,<br>there'll be more and more<br>benefit, but in the short term,<br>just asking questions of the code base,<br>what's the usefulness of that?<br>- I think the most obvious one<br>is just, you want to find out<br>where something is happening<br>in your large code base,<br>and you have a fuzzy memory of,<br>"Okay, I want to find<br>the place where we do X,"<br>but you don't exactly<br>know what to search for<br>in a normal text search.<br>So you ask a chat, you hit Command+Enter<br>to ask with the code base chat.<br>And then very often, it<br>finds the right place<br>that you were thinking of.<br>- Like you mentioned, in the future,<br>I think there's only going to<br>get more and more powerful,<br>where we're working a lot<br>on improving the quality<br>of our retrieval.<br>I think the ceiling for that<br>is really, really much higher<br>than people give the credit for.<br>- One question that's good to<br>ask here, have you considered<br>and why haven't you much done local stuff,<br>it seems like everything<br>was just discussed<br>as exceptionally difficult to do.<br>To go to the cloud, you have<br>to think about all these things<br>with the caching and the large code base<br>where a large number of<br>programmers are using<br>the same code base.<br>You have to figure out the puzzle of that.<br>A lot of it, most software<br>just does this heavy<br>computational stuff locally.<br>So, have you considered<br>doing embeddings locally?<br>- Yeah, we thought about it,<br>and I think it would be<br>cool to do it locally.<br>I think it's just really hard.<br>One thing to keep in mind<br>is that some of our users<br>use the latest MacBook Pro,<br>but most of our users,<br>more than 80% of our users<br>are in Windows machines,<br>which many of them are not very powerful.<br>So, local models really only<br>works on the latest computers,<br>and it's also a big<br>overhead to build that in.<br>So even if we would like to do that,<br>it's currently not something<br>that we are able to focus on.<br>I think there are some<br>people that do that,<br>and I think that's great,<br>but especially as models<br>get bigger and bigger<br>and you want to do fancier<br>things with bigger models,<br>it becomes even harder to do it locally.<br>- Yeah, it's not a problem<br>of weaker computers.<br>It's just that for example,<br>if you're some big company,<br>you have big company code base.<br>It's just really hard to<br>process big company code base<br>even on the beefiest MacBook Pros.<br>It's not even a matter of<br>if you're just a student<br>or something.<br>I think, if you're the best<br>programmer at a big company,<br>you're still gonna have<br>a horrible experience.<br>If you do everything locally<br>where you could do it<br>and scrape by, but again,<br>it wouldn't be fun anymore.<br>- Yeah, like at approximate<br>nearest neighbors<br>and this massive code base is<br>gonna just eat up your memory<br>and your CPU, and it's based off of that.<br>That's just that.<br>Let's talk about also<br>the modeling side where,<br>as Arvid said, there are<br>these massive headwinds<br>against local models where one,<br>things that seem to move<br>towards MOEs, which one benefit<br>is maybe their more<br>memory bandwidth bound,<br>which plays in favor of<br>local versus using GPUs<br>or using Nvidia GPUs.<br>But the downside is, these<br>models are just bigger in total,<br>and they're gonna need to fit,<br>often not even on a single<br>node but multiple nodes.<br>There's no way that's gonna fit inside<br>of even really good MacBooks.<br>I think especially for coding,<br>it's not a question as much of,<br>does it clear some bar of<br>the model's good enough<br>to do these things and<br>then we're satisfied?<br>Which may be the case for other problems<br>and maybe where local models shine,<br>but people are always gonna want the best,<br>the most intelligent,<br>the most capable things,<br>and that's gonna be<br>really, really hard to run<br>for almost all people, locally.<br>- Don't you want the most capable model?<br>You want Sonnet too?<br>- And also o1-<br>(Lex chuckling)<br>- I like how you're pitching me.<br>(group chuckling)<br>- O1 is another-<br>- Would you be satisfied<br>with an inferior model?<br>Listen, yes, I'm one of those,<br>but there's some people that<br>like to do stuff locally,<br>really, there's a whole<br>obviously open source movement<br>that resists.<br>It's good that they exist actually<br>because you wanna resist the power centers<br>that are growing our-<br>- There's actually an<br>alternative to local models<br>that I am particularly fond of.<br>I think it's still very<br>much in the research stage,<br>but you could imagine to<br>do homomorphic encryption<br>for language model inference.<br>So you encrypt your input<br>on your local machine,<br>then you send that up,<br>and then the server can<br>use loss of computation.<br>They can run models that<br>you cannot run locally<br>on this encrypted data,<br>but they cannot see what the data is,<br>and then they send back the answer<br>and you decrypt the answer and<br>only you can see the answer.<br>So I think that's still very much research<br>and all of it is about trying<br>to make the overhead lower<br>because right now, the<br>overhead is really big,<br>but if you can make that happen,<br>I think that would be really, really cool,<br>and I think it would be<br>really, really impactful<br>because I think one thing that's<br>actually worrisome is that,<br>as these models get better and better,<br>they're going to become more<br>and more economically useful.<br>And so, more and more of the<br>world's information and data<br>will flow through one or<br>two centralized actors.<br>And then there are worries about,<br>there can be traditional hacker attempts,<br>but it also creates this scary part<br>where if all of the world's<br>information is flowing<br>through one node in plaintext,<br>you can have surveillance<br>in very bad ways.<br>Initially, will be good reasons.<br>People will want to try to protect<br>against bad actors using<br>AI models in bad ways,<br>and then you will add in<br>some surveillance code.<br>And then, someone else will come in<br>and you're on a slippery slope,<br>and then you start doing bad things<br>with a lot of the world's data.<br>So, I am very hopeful<br>that we can solve homomorphic encryption<br>for language model inference.<br>- Yeah, and doing privacy,<br>preserving machine learning.<br>But I would say, that's<br>the challenge we have<br>with all software these days.<br>It's like, there's so many<br>features that can be provided<br>from the cloud and all us<br>increasingly rely on it<br>and make our life awesome.<br>But there's downsides,<br>and that's why you rely<br>on really good security<br>to protect from basic attacks.<br>But there's also only a<br>small set of companies<br>that are controlling that data,<br>and they obviously have leverage<br>and they could be infiltrated<br>in all kinds of ways.<br>That's the world we live in.<br>- Yeah, the thing I'm just<br>actually quite worried about<br>is Anthropic has this<br>responsible scaling policy<br>where we're the low ASLs,<br>which is the Anthropic<br>security level or whatever<br>of the models.<br>But as we get to, quote,<br>unquote, "ASL-3, ASL-4,"<br>whatever models which are very powerful.<br>But for mostly reasonable<br>security reasons,<br>you would wanna monitor all the prompts.<br>But I think that's<br>reasonable and understandable<br>where everyone is coming from.<br>But man, it'd be really horrible<br>if all the world's information<br>is monitored that heavily,<br>it's way too centralized.<br>It's like this really<br>fine line you're walking<br>where on the one side,<br>you don't want the models to go rogue.<br>On the other side, humans like,<br>I don't know if I trust<br>all the world's information<br>to pass through three model providers.<br>- Yeah.<br>- Why do you think it's<br>different than cloud providers?<br>- Because I think a lot of<br>this data would never have gone<br>to the cloud providers in the first place.<br>You want to give more<br>data to the AI models,<br>you want to give personal data<br>that you would never have<br>put online in the first place<br>to these companies or to these models.<br>It also centralizes control<br>where right now, for cloud,<br>you can often use your<br>own encryption keys,<br>and AWS can't really do much.<br>But here, it's just centralized actors<br>that see the exact plain<br>text of everything.<br>- On the topic of a context,<br>that's actually been a friction for me.<br>When I'm writing code in Python,<br>there's a bunch of stuff imported.<br>You could probably<br>intuit the kind of stuff<br>I would like to include in the context.<br>How hard is it to auto<br>figure out the context?<br>- It's tricky.<br>I think we can do a lot better<br>at computing the context<br>automatically in the future.<br>One thing that's important to note is,<br>there are trade-offs with<br>including automatic context.<br>So, the more context you<br>include for these models,<br>first of all, the slower they are<br>and the more expensive those requests are,<br>which means you can<br>then do less model calls<br>and do less fancy stuff in the background.<br>Also, for a lot of these<br>models, they get confused<br>if you have a lot of<br>information in the prompt.<br>So the bar for accuracy and for relevance<br>of the context you include<br>should be quite high.<br>Already, we do some automatic context<br>in some places within the product.<br>It's definitely something we<br>wanna get a lot better at.<br>I think that there are a lot<br>of cool ideas to try there,<br>both on the learning<br>better retrieval systems,<br>like better embedding<br>models, better rerankers.<br>I think that there are<br>also cool academic ideas,<br>stuff we've tried out internally,<br>but also the field is grappling<br>with writ large about,<br>can you get language models to a place<br>where you can actually<br>just have the model itself<br>understand a new corpus of information?<br>The most popular talked<br>about version of this is<br>can you make the context windows infinite?<br>Then if you make the<br>context windows infinite,<br>can you make the model<br>actually pay attention<br>to the infinite context?<br>And then, after you can<br>make it pay attention<br>to the infinite context to<br>make it somewhat feasible<br>to actually do it, can you then do caching<br>for that infinite context?<br>You don't have to recompute<br>that all the time.<br>But there are other cool<br>ideas that are being tried,<br>that are a little bit more<br>analogous to fine-tuning<br>of actually learning this information<br>in the weights of the model.<br>It might be that you<br>actually get a qualitative<br>lead different type of understanding<br>if you do it more at the weight level<br>than if you do it at the<br>in-context learning level.<br>I think the jury's still a little bit out<br>on how this is all gonna work in the end?<br>But in the interim, us as a company,<br>we are really excited about<br>better retrieval systems<br>and picking the parts of the code base<br>that are most relevant<br>to what you're doing,<br>and we could do that a lot better.<br>- One interesting proof of concept<br>for the learning this knowledge<br>directly in the weights<br>is with VS Code.<br>So, we're in a VS Code fork and VS Code.<br>The code is all public.<br>So these models in pre-training<br>have seen all the code.<br>They've probably also seen<br>questions and answers about it.<br>And then, they've been<br>fine-tuned and RLHFed<br>to be able to answer questions<br>about code in general.<br>So when you ask it a<br>question about VS Code,<br>sometimes it'll hallucinate,<br>but sometimes it actually<br>does a pretty good job<br>at answering the question.<br>It happens to be okay,<br>but what if you could<br>actually specifically train<br>or post-train a model such<br>that it really was built<br>to understand this code base?<br>It's an open research question,<br>one that we're quite interested in.<br>And then there's also uncertainty of,<br>do you want the model to be the thing<br>that end-to-end is doing everything,<br>i.e., it's doing the<br>retrieval in its internals,<br>and then answering a<br>question, creating the code,<br>or do you want to separate the retrieval<br>from the frontier model,<br>where maybe you'll get<br>some really capable models<br>that are much better than<br>the best open source ones<br>in a handful of months?<br>And then, you'll want to separately train<br>a really good open source<br>model to be the retriever,<br>to be the thing that feeds in the context<br>to these larger models.<br>- Can you speak a little<br>more to post-training a model<br>to understand the code base?<br>What do you mean by that?<br>Is this a synthetic data direction?<br>Is this-<br>- Yeah, there are many possible<br>ways you could try doing it.<br>There's certainly no shortage of ideas.<br>It's just a question of going<br>in and trying all of them<br>and being empirical about<br>which one works best.<br>One very naive thing is to<br>try to replicate what's done<br>with VS Code and these frontier models.<br>So, let's continue pre-training.<br>Some kind of continued pre-training<br>that includes general code data<br>but also throws in of the data<br>of some particular repository<br>that you care about.<br>And then in post-training, meaning,<br>let's just start with<br>instruction fine-tuning.<br>You have a normal instruction<br>fine-tuning data set<br>about code.<br>Then you throw in a lot<br>of questions about code<br>in that repository.<br>So, you could either<br>get ground truth ones,<br>which might be difficult or<br>you could do what you hinted at<br>or suggested using synthetic data,<br>i.e., having the model ask questions<br>about various recent pieces of the code.<br>So you take the pieces of the code,<br>then prompt the model or have<br>a model propose a question<br>for that piece of code,<br>and then add those as instruction<br>fine-tuning data points.<br>And then in theory, this might<br>unlock the model's ability<br>to answer questions about that code base.<br>- Let me ask you about OpenAI o1.<br>What do you think is the role<br>of that kind of test time<br>compute system in programming?<br>- I think test time compute<br>is really, really interesting.<br>So, there's been the pre-training regime<br>as you scale up the amount of data<br>and the size of your model,<br>get you better and better<br>performance both on loss,<br>and then on downstream benchmarks<br>and just general performance,<br>so we use it for coding or other tasks.<br>We're starting to hit<br>a bit of a data wall,<br>meaning, it's going to be hard to continue<br>scaling up this regime.<br>So, scaling up test time<br>compute is an interesting way,<br>if now increasing the number<br>of inference time flops.<br>Yeah, as you increase the number<br>of flops you use inference<br>time getting corresponding<br>improvements in the<br>performance of these models.<br>Traditionally, we just had to<br>literally train a bigger model<br>that always used that many more flops,<br>but now, we could perhaps<br>use the same size model<br>and run it for longer to<br>be able to get an answer<br>at the quality of a much larger model.<br>And so, the really interesting<br>thing I like about this<br>is there are some problems<br>that perhaps require<br>100 trillion parameter<br>model intelligence trained<br>on 100 trillion tokens.<br>But that's maybe 1%,<br>maybe .1% of all queries.<br>So are you going to<br>spend all of this effort,<br>all of this compute training<br>a model that costs that much<br>and then run it so infrequently?<br>You train the model that is capable<br>of doing the 99.9% of queries,<br>then you have a way of<br>inference time running it longer<br>for those few people<br>that really, really want max intelligence.<br>- How do you figure out<br>which problem requires<br>what level of intelligence?<br>Is that possible to dynamically figure out<br>when to use GPT-4, when<br>to use a small model<br>and when you need the o1?<br>(group chuckles)<br>- Yeah, that's an open<br>research problem, certainly.<br>I don't think anyone's actually cracked<br>this model routing problem quite well.<br>We have initial implementations of this<br>for something like Cursor Tab,<br>but at the level of going<br>between 4o Sonnet to o1,<br>it's a bit trickier.<br>There's also a question like,<br>what level of intelligence<br>do you need to determine<br>if the thing is too hard<br>for the four level model?<br>Maybe you need the o1 level model.<br>It's really unclear.<br>- But you mentioned this.<br>So, there's a pre-training process<br>then there's post-training,<br>and then there's test time compute.<br>Is that fair to separate?<br>Where's the biggest gains?<br>- Well, it's weird<br>because test time compute,<br>there's a whole training strategy needed<br>to get test time compute to work.<br>The other really weird thing about this<br>is outside of the big labs<br>and maybe even just OpenAI,<br>no one really knows how it works.<br>There've been some<br>really interesting papers<br>that show hints of what<br>they might be doing.<br>So, perhaps they're doing something<br>with tree search using<br>process reward models.<br>But yeah, I think the issue<br>is we don't quite know<br>exactly what it looks like,<br>so it would be hard to<br>comment on where it fits in.<br>I would put it in post-training,<br>but maybe the compute spent<br>for this forgetting test time<br>compute to work for a model<br>is going to dwarf pre-training eventually.<br>- So we don't even know if o1<br>is using just chain of thought<br>or we don't know how<br>they're using any of these?<br>We don't know anything?<br>- It's fun to speculate.<br>(group chuckling)<br>- If you were to build a competing<br>model, what would you do?<br>- Yeah, so one thing to do would be,<br>I think you probably need to<br>train a process reward model.<br>So maybe we can get into reward models<br>and outcome reward models<br>versus process reward models.<br>Outcome reward models are<br>the traditional reward models<br>that people are trained<br>for language modeling,<br>and it's just looking at the final thing.<br>So if you're doing some math problem,<br>let's look at that final thing.<br>You've done everything and<br>let's assign a grade to it,<br>how likely we think.<br>What's the reward for this outcome?<br>Process reward models<br>instead try to grade the chain of thought.<br>And so OpenAI had preliminary<br>paper on this, I think,<br>last summer where they use human labelers<br>to get this pretty large several<br>hundred thousand data set<br>of creating chains of thought.<br>Ultimately, it feels like<br>I haven't seen anything<br>interesting in the ways<br>that people use process reward models<br>outside of just using it<br>as a means of affecting<br>how we choose between a bunch of samples.<br>So, what people do in all these papers<br>is they sample a bunch of<br>outputs from the language model,<br>and then use the process reward models<br>to grade all those generations<br>alongside maybe some other heuristics,<br>and then use that to<br>choose the best answer.<br>The really interesting thing<br>that people think might work<br>and people want to work is tree search<br>with these process reward models.<br>Because if you really can<br>grade every single step<br>of the chain of thought,<br>then you can branch out<br>and explore multiple paths<br>of this chain of thought<br>and then use these process<br>reward models to evaluate<br>how good is this branch<br>that you're taking.<br>- Yeah, when the quality of the branch<br>is somehow strongly correlated<br>with the quality of the<br>outcome at the very end,<br>so you have a good model of<br>knowing which branch to take.<br>So not just in the short<br>term, in the long term?<br>- Yeah.<br>The interesting work that<br>I think has been done<br>is figuring out how to<br>properly train the process,<br>or the interesting work<br>that has been open sourced<br>and people I think talk about is<br>how to train the process reward models,<br>maybe in a more automated way.<br>I could be wrong here, could<br>not be mentioning some papers.<br>I haven't seen anything super<br>that seems to work really well<br>for using the process<br>reward models creatively<br>to do tree search and code.<br>- This is an AI safety,<br>maybe a bit of a philosophy question.<br>So OpenAI says that they're<br>hiding the chain of thought<br>from the user,<br>and they've said that that was<br>a difficult decision to make.<br>Instead of showing the chain of thought,<br>they're asking the model to<br>summarize the chain of thought.<br>They're also in the background saying<br>they're going to monitor<br>the chain of thought<br>to make sure the model is not<br>trying to manipulate the user,<br>which is a fascinating possibility.<br>But anyway, what do you think<br>about hiding the chain of thought?<br>- One consideration for OpenAI,<br>and this is completely speculative,<br>could be that they wanna<br>make it hard for people<br>to distill these capabilities<br>out of their model.<br>It might actually be easier<br>if you had access to that<br>hidden chain of thought<br>to replicate the technology,<br>because pretty important data,<br>like seeing the steps that the model took<br>to get to the final results.<br>- So, you could probably<br>train on that also?<br>- And there was a mirror<br>situation with this,<br>with some of the large<br>language model providers,<br>and also this is speculation,<br>but some of these APIs<br>used to offer easy access<br>to log probabilities for all the tokens<br>that they're generating<br>and also log probabilities<br>over the prompt tokens.<br>And then some of these<br>APIs took those away.<br>Again, complete speculation,<br>but one of the thoughts<br>is that the reason those were taken away<br>is if you have access to log probabilities<br>similar to this hidden chain of thought,<br>that can give you even more information<br>to try and distill these<br>capabilities out of the APIs,<br>out of these biggest models<br>and to models you control.<br>As an asterisk on also<br>the previous discussion<br>about us integrating o1,<br>I think that we're still<br>learning how to use this model.<br>So, we made o1 available in Cursor<br>because when we got the model,<br>we were really interested<br>in trying it out.<br>I think a lot of programmers<br>are gonna be interested<br>in trying it out.<br>O1 is not part of the<br>default Cursor experience<br>in any way up,<br>and we still haven't found<br>a way to yet integrate it<br>into the editor in a way<br>that we reach for every hour,<br>maybe even every day.<br>So, I think that the jury's still out<br>on how to use the model,<br>and we haven't seen examples<br>yet of people releasing things<br>where it seems really clear like,<br>"Oh, that's now the use case."<br>The obvious one to turn to<br>is maybe this can make it easier<br>for you to have these<br>background things running,<br>to have these models and loops,<br>to have these models be agentic.<br>But we're still discovering.<br>- To be clear, we have ideas.<br>We just need to try and get<br>something incredibly useful<br>before we put it out there.<br>- But it has these<br>significant limitations.<br>Even barring capabilities,<br>it does not stream.<br>That means it's really, really<br>painful to use for things<br>where you want to supervise the output.<br>Instead, you're just waiting<br>for the wall text to show up.<br>Also, it does feel like the<br>early innings of test time,<br>compute and search where it's<br>just a very, very much a v0,<br>and there's so many things<br>that don't feel quite right.<br>I suspect in parallel to<br>people increasing the amount<br>of pre-training data<br>and the size of the<br>models and pre-training<br>and finding tricks there, you'll<br>now have this other thread<br>of getting search to<br>work better and better.<br>- So, let me ask you about<br>strawberry tomorrow eyes.<br>(group chuckles)<br>So, it looks like GitHub<br>Copilot might be integrating o1<br>in some kind of way,<br>and I think some of the<br>comments are saying,<br>does this mean Cursor is done?<br>(group chuckles)<br>I think I saw one comment saying that.<br>- It's a time to shut down Cursor, yeah.<br>- Time to shut down Cursor, thank you.<br>(group chuckling)<br>So, is it time to shut down Cursor?<br>- I think this space is<br>a little bit different<br>from past software spaces over the 2010s,<br>where I think that the ceiling here<br>is really, really, really incredibly high.<br>So, I think that the best<br>product in three to four years<br>will just be soon much more useful<br>than the best product today.<br>You can wax poetic about<br>moats this and brand that<br>and this is our advantage,<br>but I think in the end,<br>just if you stop innovating<br>on the product, you will lose.<br>That's also great for startups,<br>that's great for people<br>trying to enter this market<br>because it means you have an opportunity<br>to win against people who<br>have lots of users already<br>by just building something better.<br>And so, I think over the next few years,<br>it's just about building the best product,<br>building the best system,<br>and that both comes down<br>to the modeling engine side of things,<br>and it also comes down to<br>the editing experience.<br>- Yeah, I think most of the<br>additional value from Cursor<br>versus everything else out there<br>is not just integrating<br>the new model fast like o1.<br>It comes from all of the depth<br>that goes into these custom models<br>that you don't realize are working for you<br>in every facet of the product,<br>as well as the really thoughtful UX<br>with every single feature.<br>- All right, from that profound answer,<br>let's descend back down to the technical.<br>You mentioned you have a<br>taxonomy of synthetic data.<br>- (chuckles) Oh, yeah.<br>- Can you please explain?<br>- Yeah, I think there are three main kinds<br>of synthetic data.<br>So what is synthetic data, first?<br>So there's normal data,<br>like non-synthetic data,<br>which is just data<br>that's naturally created,<br>i.e., usually it'll be from<br>humans having done things.<br>So, from some human<br>process you get this data.<br>Synthetic data, the first<br>one would be distillation.<br>So having a language model,<br>output tokens or probability<br>distributions over tokens,<br>and then you can train some<br>less capable model on this.<br>This approach is not gonna<br>get you a more capable model<br>than the original one that<br>has produced the tokens,<br>but it's really useful<br>if there's some capability<br>you wanna elicit<br>from some really expensive<br>high-latency model.<br>You can then distill that down<br>into some smaller task-specific model.<br>The second kind is when one<br>direction of the problem<br>is easier than the reverse.<br>So, a great example of<br>this is bug detection,<br>like we mentioned earlier,<br>where it's a lot easier to<br>introduce reasonable-looking bugs<br>than it is to actually detect them.<br>And this is probably<br>the case for humans too.<br>And so what you can do,<br>is you can get a model<br>that's not trained in that much<br>data, that's not that smart,<br>to introduce a bunch of bugs and code.<br>And then, you can use that to then train.<br>Use the synthetic data to train a model<br>that can be really good at detecting bugs.<br>The last category I think<br>is, I guess the main one<br>that it feels like the big labs are doing<br>for synthetic data,<br>which is producing text<br>with language models that<br>can then be verified easily.<br>So, extreme example of this<br>is if you have a verification<br>system that can detect<br>if language is Shakespeare level,<br>and then you have a bunch of<br>monkeys typing and typewriters.<br>You can eventually get<br>enough training data<br>to train a Shakespeare-level<br>language model.<br>And I mean this is very<br>much the case for math<br>where verification is<br>actually really, really easy<br>for formal languages.<br>And then what you can do, is<br>you can have an okay model,<br>generate a ton of rollouts,<br>and then choose the ones<br>that you know have actually proved<br>the ground truth theorems,<br>and train that further.<br>There's similar things you can do for code<br>with lead code like problems,<br>where if you have some set of tests<br>that you know correspond to if<br>something passes these tests,<br>it actually solved problem.<br>You could do the same thing<br>where you verify that it's passed the test<br>and then train the model in the outputs<br>that have passed the tests.<br>I think it's gonna be a little<br>tricky getting this to work<br>in all domains, or just in general.<br>Having the perfect verifier<br>feels really, really hard to do<br>with just open-ended miscellaneous tasks.<br>You give the model or<br>more long horizon tasks,<br>even in coding.<br>- [Lex] That's 'cause you're<br>not as optimistic as Arvid.<br>But yeah, so yeah,<br>(Aman chuckles)<br>that third category<br>requires having a verifier.<br>- Yeah.<br>Verification, it feels like<br>it's best when you know<br>for a fact that it's correct.<br>And then it wouldn't be<br>like using a language model<br>to verify, it would be using<br>tests or formal systems.<br>- Or running the thing too.<br>Doing the human form of verification,<br>where you just do manual quality control.<br>- Yeah.<br>- Yeah.<br>- But the language model version of that,<br>where it's running the thing<br>and it actually understands the output.<br>- Yeah, no, that's-<br>- I'm sure it's somewhere in between.<br>- Yeah.<br>I think that's the category<br>that is most likely to result<br>in massive gains.<br>- What about RL with feedback<br>side RLHF versus RLAIF?<br>What's the role of that<br>in getting better<br>performance on the models?<br>- Yeah.<br>So, RLHF is when the reward<br>model you use is trained<br>from some labels you've collected<br>from humans giving feedback.<br>I think this works if you have the ability<br>to get a ton of human feedback<br>for this kind of task that you care about.<br>RLAIF is interesting<br>because it's depending on the<br>constraint that verification<br>is actually a decent bit<br>easier than generation.<br>Because it feels like,<br>okay, what are you doing?<br>Are you using this language model<br>to look at the language model outputs<br>and then prove the language model?<br>But no, it actually may work<br>if the language model has a<br>much easier time verifying<br>some solution than it does generating it.<br>Then you actually could perhaps<br>get this recursive loop.<br>But I don't think it's gonna<br>look exactly like that.<br>The other thing you could<br>do, that we kind of do,<br>is a little bit of a<br>mix of RLAIF and RLHF,<br>where usually the model<br>is actually quite correct<br>and this is the case of<br>precursor tap picking<br>between two possible generations<br>of what is the better one.<br>And then, it just needs a<br>little bit of human nudging<br>with only on the order 50, 100 examples<br>to align that prior the model has<br>with exactly with what you want.<br>It looks different than<br>I think normal RLHF<br>where you're usually<br>training these reward models<br>in tons of examples.<br>- What's your intuition<br>when you compare generation<br>and verification or<br>generation and ranking?<br>Is ranking way easier than generation?<br>- My intuition would just<br>say, yeah, it should be.<br>Like, if you believe P does not equal NP,<br>then there's this<br>massive class of problems<br>that are much, much easier<br>to verify given proof,<br>than actually proving it.<br>- I wonder if the same thing<br>will prove P not equal to NP<br>or P equal to NP.<br>- (chuckles) That would be really cool.<br>- That'd be a whatever Field's Medal<br>(group giggling)<br>by AI.<br>Who gets the credit?<br>Another the open philosophical question.<br>(group chuckling)<br>- Whoever prompted it.<br>(group chuckling)<br>- I'm actually surprisingly curious<br>what a good bet for one AI<br>will get the Field's Medal will be.<br>I actually don't have-<br>- Isn't this<br>Aman's specialty?<br>- I don't know what Aman's bet here is.<br>- Oh, sorry, Nobel Prize<br>or Field's Medal first?<br>- Field's Medal-<br>- Oh, Field's Medal level?<br>- Field's Medal comes first, I think.<br>- Field's Medal comes first.<br>Well, you would say that, of course.<br>(group chuckling)<br>- But it's also this<br>isolated system you verify.<br>- Sure.<br>- Yeah.<br>- I don't even know if I-<br>- You don't need to do (indistinct).<br>- I feel like I have<br>much more to do there.<br>It felt like the path<br>to get to IMO was a little bit more clear.<br>Because it already could<br>get a few IMO problems<br>and there was a bunch<br>of low-hanging fruit,<br>given the literature at the time,<br>of what tactics people could take.<br>I think I'm, one, much less versed<br>in the space of theorem proving now.<br>And two, less intuition about<br>how close we are to solving<br>these really, really hard open problems.<br>- So you think you'll<br>be Field's Medal first?<br>It won't be in physics or in-<br>- Oh, 100%, I think that's<br>probably more likely.<br>It is probably much more<br>likely that it'll get in.<br>Yeah, yeah, yeah.<br>Well, I think it both<br>to, I don't know, BSD,<br>which is a Birch and<br>Swinnerton-Dyer conjecture,<br>or (indistinct) iPods,<br>or any one of these hard math problems<br>are just actually really hard.<br>It's unclear what the path<br>to get even a solution looks like.<br>We don't even know what a path looks like,<br>let alone (indistinct).<br>- And you don't buy the idea<br>this is just like an isolated system<br>and you can actually have<br>a good reward system,<br>and it feels like it's<br>easier to train for that.<br>- I think we might get<br>Field's Medal before AGI.<br>- I mean, I'd be very happy.<br>I'd be very happy.<br>But I don't know if I think 2028, 2030.<br>(Aman chuckles)<br>- For Field's Medal?<br>- Field's Medal.<br>- All right.<br>It feels like forever from now,<br>given how fast things have been going.<br>Speaking of how fast<br>things have been going,<br>let's talk about scaling laws.<br>So, for people who don't know,<br>maybe it's good to talk<br>about this whole idea<br>of scaling laws.<br>What are they, where'd you think stand,<br>and where do you think things are going?<br>- I think it was interesting.<br>The original scaling laws paper<br>by OpenAI was slightly wrong.<br>'Cause I think of some issues they did<br>with learning right schedules.<br>And then, Chinchilla showed<br>a more correct version.<br>And then, from then<br>people have again deviated<br>from doing the compute optimal thing.<br>'Cause people start now optimizing more so<br>for making the thing work really well<br>given an inference budget.<br>And I think there are a lot<br>more dimensions to these curves<br>than what we originally used,<br>of just compute number<br>of parameters and data.<br>Like inference compute is the obvious one.<br>I think context length<br>is another obvious one.<br>Let's say you care about the two things<br>of inference compute<br>and then context window,<br>maybe the thing you wanna<br>train is some kind of SSM.<br>Because they're much,<br>much cheaper and faster<br>at super, super long context.<br>And even if, maybe it was<br>10 X more scaling properties<br>during training, meaning,<br>you spend 10 X more compute<br>to train the thing to get the<br>same level of capabilities,<br>it's worth it<br>because you care most<br>about that inference budget<br>for really long context windows.<br>So, it'll be interesting to see<br>how people play with all these dimensions.<br>- So, yeah, I mean, you speak<br>to the multiple dimensions, obviously.<br>The original conception was<br>just looking at the variables<br>of the size of the model<br>as measured by parameters,<br>and the size of the data<br>as measured by the number of tokens,<br>and looking at the ratio of the two.<br>- Yeah.<br>- And it's kind of a compelling notion<br>that there is a number,<br>or at least a minimum.<br>And it seems like one was emerging.<br>Do you still believe<br>that there is a kind of bigger is better?<br>- I mean, I think bigger<br>is certainly better<br>for just raw performance.<br>- And raw intelligence.<br>- And raw intelligence.<br>I think the path that people might take,<br>I'm particularly bullish on distillation.<br>And how many knobs can you turn to,<br>if we spend a ton, ton<br>of money on training,<br>get the most capable cheap model.<br>Really, really caring as much as you can.<br>'Cause the naive version of<br>caring as much as you can<br>about inference time compute,<br>is what people have already<br>done with the Llama models.<br>Or just over-training<br>the shit out of 7B models<br>on way, way, way more tokens<br>than is essential optimal.<br>But if you really care about it,<br>maybe the thing to do is what Gamma did,<br>which is let's not just train on tokens,<br>let's literally train on<br>minimizing the KL divergence<br>with the distribution of gemma 27B, right?<br>So knowledge distillation there.<br>And you're spending the compute<br>of literally training this<br>27 billion parameter model<br>on all these tokens, just to get out this,<br>I don't know, smaller model.<br>- And the distillation gives<br>you just a faster model,<br>smaller means faster.<br>- Yeah, distillation in theory is,<br>I think, getting out more signal<br>from the data that you're training on.<br>And it's perhaps another<br>way of getting over,<br>not completely over,<br>but partially helping with the data wall.<br>Where you only have so<br>much data to train on,<br>let's train this really, really big model<br>on all these tokens<br>and we'll distill it<br>into this smaller one.<br>And maybe we can get more signal per token<br>for this much smaller model<br>than we would've originally<br>if we trained it.<br>- So if I gave you $10 trillion,<br>how would you spend it?<br>(Aman chuckles)<br>I mean, you can't buy<br>an island or whatever.<br>How would you allocate it<br>in terms of improving the big model<br>versus maybe paying for HF in the RLHF?<br>- Yeah, yeah.<br>I think, there's a lot of<br>these secrets and details<br>about training these large<br>models that I just don't know,<br>and are only privy to the large labs.<br>And the issue is, I would<br>waste a lot of that money<br>if I even attempted this,<br>because I wouldn't know those things.<br>Suspending a lot of disbelief<br>and assuming you had the know-how,<br>or if you're saying you have to operate<br>with the limited information you have now.<br>- No, no, no, actually, I would say,<br>you swoop in and you<br>get all the information,<br>all the little heuristics,<br>all the little parameters,<br>all the parameters that define<br>how the thing is trained.<br>- Mm-hmm.<br>- If we look in how to invest<br>money for the next five years<br>in terms of maximizing what<br>you called raw intelligence.<br>- I mean, isn't the answer really simple?<br>You just try to get as<br>much compute as possible.<br>At the end of the day, all<br>you need to buy is the GPUs.<br>You can tune whether you<br>want to pre-train a big model<br>or a small model.<br>- Well, this gets into the question<br>of are you really limited<br>by compute and money,<br>or are you limited by these other things?<br>- I'm more privy to Arvid's<br>belief that we're idea-limited,<br>but there's always that like-<br>- But if you have a lot of compute,<br>you can run a lot of experiments.<br>- So you would run a lot of experiments<br>versus use that compute<br>to trend a gigantic model?<br>- I would, but I do<br>believe that we are limited<br>in terms of ideas that we have.<br>- I think yeah, 'cause<br>even with all this compute,<br>and all the data you could<br>collect in the world,<br>I think you really are ultimately<br>limited by not even ideas,<br>but just really good engineering.<br>There aren't that many people in the world<br>who really can make the difference here.<br>And there's so much work<br>that goes into research<br>that is just pure, really,<br>really hard engineering work.<br>As a very hand-wavy example,<br>if you look at the<br>original Transformer paper,<br>how much work was joining together a lot<br>of these really interesting<br>concepts embedded<br>in the literature, versus then going in<br>and writing all the codes,<br>maybe the CUDA kernels,<br>maybe whatever else.<br>I don't know if it ran them GPUs or TPUs.<br>Originally, such that<br>it actually saturated<br>the GPU performance.<br>Getting GNOME Azure to go<br>in and do all this code.<br>And GNOME is probably<br>one of the best engineers<br>in the world.<br>Or maybe going a step further,<br>like the next generation of<br>models, having these things.<br>Like getting model parallelism to work,<br>and scaling it on thousands of,<br>or maybe tens of thousands of V100s,<br>which I think GBDE-III may have been.<br>There's just so much engineering effort<br>that has to go into all of<br>these things to make it work.<br>If you really brought that<br>cost down to maybe not zero,<br>but just made it 10 X easier,<br>made it super easy for someone<br>with really fantastic ideas,<br>to immediately get to the version<br>of the new architecture they dreamed up,<br>that is getting 50, 40%<br>utilization on their GPUs,<br>I think that would just<br>speed up research by a ton.<br>- I mean, I think if you see<br>a clear path to improvement,<br>you should always take the<br>low-hanging fruit first, right?<br>I think probably OpenAI<br>and all the other labs<br>that did the right thing to<br>pick off the low-hanging fruit.<br>Where the low-hanging fruit is like,<br>you could scale up to a GPT-4.25 scale<br>and you just keep scaling,<br>and things keep getting better.<br>There's no point of<br>experimenting with new ideas<br>when everything is working.<br>And you should bang on<br>and to try to get as much as<br>much juice out of the possible.<br>I think if you're spending $10 trillion,<br>you probably wanna spend some,<br>then actually reevaluate your ideas,<br>probably your idea a<br>little bit at that point.<br>- I think all of us believe<br>new ideas are probably needed<br>to get all the way there to AGI.<br>And all of us also probably believe<br>there exist ways of<br>testing out those ideas<br>at smaller scales, and<br>being fairly confident<br>that they'll play out.<br>It's just quite difficult for the labs<br>in their current position<br>to dedicate their very limited research<br>and engineering talent to<br>exploring all these other ideas,<br>when there's this core thing<br>that will probably improve performance<br>for some decent amount of time.<br>- Yeah, but also, these<br>big labs like winning.<br>(Lex chuckles)<br>So, they're just going wild.<br>Okay.<br>(all chuckling)<br>So, big question, looking<br>out into the future.<br>You're now at the center<br>of the programming world.<br>How do you think programming,<br>the nature of programming<br>changes in the next few months,<br>in the next year, in the next two years<br>and the next five years, 10 years?<br>- I think we're really<br>excited about a future<br>where the programmer<br>is in the driver's seat<br>for a long time.<br>And you've heard us talk<br>about this a little bit,<br>but one that emphasizes speed<br>and agency for the programmer and control.<br>The ability to modify<br>anything you wanna modify,<br>the ability to iterate really<br>fast on what you're building.<br>And this is a little different,<br>I think, than where some people<br>are jumping to in the space,<br>where I think one idea<br>that's captivated people,<br>is can you talk to your computer?<br>Can you have it build software for you?<br>As if you're talking to<br>an engineering department<br>or an engineer over Slack.<br>And can it just be this<br>sort of isolated text box?<br>And part of the reason we're<br>not excited about that,<br>is some of the stuff we've<br>talked about with latency,<br>but then a big piece, a reason<br>we're not excited about that,<br>is because that comes with<br>giving up a lot of control.<br>It's much harder to be really specific<br>when you're talking in the text box.<br>And if you're necessarily<br>just going to communicate<br>with a thing like you<br>would be communicating<br>with an engineering department,<br>you're actually advocating tons<br>of really important decisions to this bot.<br>And this kind of gets at, fundamentally,<br>what engineering is.<br>I think that some people<br>who are a little bit more<br>removed from engineering<br>might think of it as the spec<br>is completely written out<br>and then the engineers just<br>come and they just implement.<br>And it's just about making<br>the thing happen in code<br>and making the thing exist.<br>But I think a lot of the best engineering,<br>the engineering we enjoy,<br>involves tons of tiny micro decisions<br>about what exactly you're building,<br>and about really hard trade-offs<br>between speed and cost<br>and just all the other<br>things involved in a system.<br>As long as humans are actually<br>the ones designing the software<br>and the ones specifying<br>what they want to be built,<br>and it's not just like<br>company run by all AIs,<br>we think you'll really want the human<br>in a driver's seat<br>dictating these decisions.<br>And so the jury's still out<br>on what that looks like.<br>I think that one weird idea<br>for what that could look like,<br>is it could look like you can control<br>the level of abstraction<br>you view a code base at.<br>And you can point at specific<br>parts of a code base,<br>like, maybe you digest a<br>code base by looking at it<br>in the form of pseudocode.<br>And you can actually<br>edit that pseudocode too,<br>and then have changes get made down<br>at the formal programming level.<br>And you can gesture at any piece of logic<br>in your software component of programming.<br>You keep the inflow text editing<br>component of programming,<br>you keep the control of, you<br>can even go down into the code,<br>you can go at higher<br>levels of abstraction,<br>while also giving you these<br>big productivity gains.<br>- It'd be nice<br>if you can go up and down<br>the abstraction stack.<br>- Yeah.<br>And there are a lot of<br>details to figure out there<br>that's sort of like a fuzzy idea.<br>Time will tell if it actually works.<br>But these principles of control and speed<br>in the human in the driver's seat,<br>we think are really important.<br>We think for some things<br>like Arvid mentioned before,<br>for some styles of programming,<br>you can hand it off chatbot-style.<br>If you have a bug that's<br>really well specified.<br>But that's not most of programming,<br>and that's also not<br>most of the programming<br>we think a lot of people value.<br>- What about the fundamental<br>skill of programming?<br>There's a lot of people, like<br>young people right now scared,<br>'cause they love programming,<br>but they're scared about,<br>"Will I be able to have a future<br>if I pursue this career path?"<br>Do you think the very skill of programming<br>will change fundamentally?<br>- I actually think this is a<br>really, really exciting time<br>to be building software.<br>We remember what programming was like<br>in 2013, 2012, whatever it was.<br>And there was just so much<br>more cruft and boilerplate<br>and looking up something really gnarly.<br>And that stuff still exists,<br>it's definitely not at zero.<br>But programming today is<br>way more fun than back then.<br>It's like we're really getting down<br>to the delight concentration.<br>And all the things that really<br>draw people to programming,<br>for instance, this element of being able<br>to build things really fast and speed,<br>and also individual control,<br>all those are just being turned up a ton.<br>And so I think it's gonna<br>be a really, really fun time<br>for people who build software.<br>I think that the skills<br>will probably change too.<br>I think that people's<br>taste and creative ideas<br>will be magnified.<br>And it will be maybe less, a little bit,<br>about boilerplate text editing.<br>Maybe even a little bit<br>less about carefulness,<br>which I think is really important today<br>if you're a programmer.<br>I think it'll be a lot more fun.<br>- What do you guys think?<br>- I agree.<br>I'm very excited to be able to change.<br>One thing that happened recently,<br>was we wanted to do a<br>relatively big migration<br>to our code base.<br>We were using<br>AsyncLocalStorage in Node.js,<br>which is known to be not very performant,<br>and we wanted to migrate<br>to a context object.<br>And this is a big migration<br>and affects the entire code base.<br>Sualeh and I spent, I don't know,<br>five days working through this,<br>even with today's AI tools.<br>And I am really excited for a future<br>where I can just show a couple of examples<br>and then the AI applies that<br>to all of the locations.<br>And then it highlights,<br>"Oh, this is a new<br>example, what should I do?"<br>And then, I show exactly what to do there.<br>And then, that can be done in 10 minutes.<br>And then, you can iterate<br>much, much faster.<br>Then, you don't have to<br>think as much upfront<br>and stand at the blackboard and think,<br>"Exactly, how are we gonna do this,<br>because the cost is so high?"<br>But you can just try something<br>first and you realize,<br>"Oh, this is not actually<br>exactly what I want."<br>And then, you can change<br>it instantly again after.<br>And so, yeah, I think being<br>a programmer in the future<br>is going to be a lot of fun.<br>- Yeah, I really like that point.<br>It feels like a lot of<br>the time with programming,<br>there are two ways you can go about it.<br>One is you think really<br>hard, carefully upfront<br>about the best possible way to do it,<br>and then you spend your<br>limited time of engineering<br>to actually implement it.<br>But I must refer just getting in the code<br>and taking a crack at<br>seeing how it lays out<br>and then iterating really quickly on that.<br>That feels more fun.<br>- Yeah, just speaking to generate<br>the boilerplate, is great.<br>So you just focus on the nuanced,<br>difficult design decisions.<br>Migration, I feel like this is a cool one.<br>It seems like a larger<br>language models is able<br>to basically translate for one<br>program language to another.<br>Or translate, migrate in the general sense<br>of what migrate is.<br>But that's in the current moment.<br>So mean the fear has to do with,<br>okay, as these models<br>get better and better,<br>then you're doing less and<br>less creative decisions.<br>And is it going to kind of move to a place<br>where you're operating<br>in the design space of natural language<br>where natural language is the<br>main programming language?<br>And, I guess, I could ask<br>that by way of advice.<br>If somebody's interested<br>in programming now,<br>what do you think they should learn?<br>You guys started in some Java.<br>(group chuckling)<br>And I forget, oh, some PHP.<br>- PHP.<br>- Objective-C.<br>- Objective-C, there you go.<br>I mean in the end, we all know<br>JavaScript was going to win<br>(group chuckling)<br>and not TypeScript.<br>It's going to be like vanilla JavaScript.<br>It's just going to eat the<br>world and maybe live with PHP.<br>And I mean, it also<br>brings up the question of,<br>I think Don Knuth has this idea<br>that some percent of<br>the population is geeks,<br>and there's a particular<br>kind of psychology<br>in mind required for programming.<br>And it feels like more<br>and more that expands<br>the kind of person that should be able to,<br>can do great programming might expand.<br>- I think different people do programming<br>for different reasons.<br>But I think the true,<br>maybe the best programmers<br>are the ones that really love,<br>just absolutely love programming.<br>For example, there are folks on our team<br>who literally when they<br>get back from work,<br>they go and then they boot up Cursor,<br>and then they start coding<br>on their side projects<br>for the entire night,<br>and they stay up until 3:00 am doing that.<br>And when they're sad, they said,<br>"I just really need to code."<br>(group chuckling)<br>And I think there's<br>that level of programmer<br>where this obsession<br>and love of programming,<br>I think makes, really,<br>the best programmers.<br>And I think these types of people<br>will really get into the<br>details of how things work.<br>- I guess the question I'm<br>asking, that exact programmer,<br>let's think about that person.<br>When the super Tab,<br>the super awesome praise<br>be the Tab succeeds,<br>and you keep pressing Tab.<br>- That person in the team loves Cursor Tab<br>more than anybody else, right?<br>- Yeah.<br>Pressing Tab is just pressing Tab.<br>That's the easy way to<br>say it in the catchphrase.<br>But what you're actually doing<br>when you're pressing Tab,<br>is that you're injecting<br>intent all the time<br>while you're doing it.<br>Sometimes you're rejecting it,<br>sometimes you're typing<br>a few more characters.<br>And that's the way that<br>you're shaping the things<br>that's being created.<br>And I think programming<br>will change a lot to just,<br>"What is it that you want to make?"<br>- It's sort of higher bandwidth.<br>The communication to the computer<br>just becomes higher and higher bandwidth<br>as opposed to just typing<br>as much lower bandwidth<br>than communicating intent.<br>- I mean, this goes to your manifesto<br>titled Engineering Genius.<br>"We are an applied research lab building<br>extraordinary productive<br>human AI systems."<br>So, speaking to this hybrid element.<br>"To start, we're building<br>the engineer of the future,<br>a human AI programmer that's an order<br>of magnitude more effective<br>than any one engineer.<br>This hybrid engineer will<br>have effortless control<br>over their code base and<br>no low entropy keystrokes.<br>They will iterate at the<br>speed of their judgment,<br>even in the most complex systems.<br>Using a combination of<br>AI and human ingenuity,<br>they will out-smart and out-engineer<br>the best pure AI systems.<br>We are a group of<br>researchers and engineers.<br>We build software and models to invent<br>at the edge of what's<br>useful and what's possible.<br>Our work has already improved the lives<br>of hundreds of thousands of programmers."<br>And on the way to that,<br>we'll at least make programming more fun.<br>So, thank you for talking today.<br>- Thank you.<br>- Thanks for having us.<br>- Thank you.<br>- Thank you.<br>- Thanks for listening<br>to this conversation<br>with Michael, Sualeh, Arvid and Aman.<br>To support this podcast,<br>please check out our<br>sponsors in the description.<br>And now, let me leave<br>you with a random, funny,<br>and perhaps profound programming<br>code I saw on Reddit.<br>Nothing is as permanent as a<br>temporary solution that works.<br>Thank you for listening and<br>hope to see you next time.