Xpeng’s head of autonomous driving told Electrek that the company is spending roughly 300 million RMB (~$41 million) per month on AI training alone and believes it has already reached parity with Tesla’s FSD v13 — with v14 within reach before the end of summer.
I sat down with Dr. Xianming Liu, head of Xpeng’s General Intelligence Center, the day after he delivered a keynote at CVPR 2026 in Denver — sharing the stage with Tesla’s Ashok Elluswamy and leaders from Nvidia and Waymo at one of the world’s top computer vision conferences.
The conversation covered Xpeng’s VLA 2.0 architecture, the company’s sensor strategy, its Volkswagen licensing deal, and why Dr. Liu believes the entire autonomous driving industry needs to stop treating language models as the answer to self-driving.
Here’s the full interview:
‘Language is poison’
Dr. Liu has become known for a provocative statement: “language is poison” when it comes to autonomous driving. In our interview, he explained the nuance behind that headline.
Xpeng’s first-generation VLA (Vision-Language-Action) model used language tokens as an intermediate step — the system would see the road, translate what it saw into language-like representations, then convert those into driving actions. VLA 2.0, which I tested in Beijing in April and found comparable to Tesla’s FSD v14, removes that middle step entirely.
But Liu clarified that Xpeng hasn’t abandoned language completely. The system still accepts language as input — text prompts and instructions from the driver. What it removes is language as an intermediate output during the actual driving process.
“We still utilize languages as input, so this is a key to improve the generalizability,” Liu said. “You talk to the vehicle and you give some instructions. The vehicle needs to understand how to execute these instructions. But during the driving, we don’t output any language tokens because this is a redundancy or a bottleneck of the model.”
The reasoning is straightforward: the car ingests roughly two billion visual tokens per second from its cameras but only needs maybe 10 or 20 tokens to control the steering wheel and pedals. That’s a massive dimensionality reduction, and adding a language translation step in between just introduces unnecessary computation and latency.
“In order to get a language expression, you need to generate a lot of unnecessary computation to explain it. So that’s why we remove the language as intermediate layers, but we still keep the language as input,” he explained.
World model: the next piece of the puzzle
Liu used his CVPR keynote to unveil something new — Xpeng’s world model, which he frames not as a separate technology from VLA but as “the other side of the same problem.”
VLA 2.0 learns from human driving behavior — it studies how millions of drivers react in real situations and learns to replicate those decisions. The world model learns the physics of the environment — it predicts what will happen next in the scene, how other agents will move, what the consequences of any given action will be.
“People trying to separate the two concepts, world model and VLA, as two dimensions of the technology, but actually they are the same,” Liu said. “Our goal is trying to build a foundation model powerful enough to understand the world.”
The practical application: Xpeng is now training VLA 2.0 to simultaneously predict what the cameras will see in the near future and what the car should do — combining driving and world prediction in one model. The company expects to deploy this upgrade to production vehicles later this year.
Xpeng has published a series of research papers backing this work, including X-World for controllable video generation, X-Foresight for joint future prediction and planning, and X-Cache which cuts world model computation by 70% with negligible quality loss. The company also had its “DrivePTS” paper on driving scene generation accepted at CVPR 2026.
Radar stays for active safety, not for driving
One detail that often gets lost in Xpeng’s “pure vision” marketing: the P7+, G7, and other recent models still carry three millimeter-wave radars and twelve ultrasonic sensors alongside their cameras. I asked Liu how those fit into an end-to-end architecture.
His answer was clear: they don’t feed into the main driving AI at all.
“We do utilize these other sensors, but they are utilized for the active safety system, which requires an orthogonal system to be totally redundant with the main driving system,” Liu said. The radar and ultrasonics power AEB (automatic emergency braking) and AES (automatic emergency steering) — a completely separate safety layer.
The main driving system is vision-only. Liu’s reasoning comes down to information density and latency: “Camera readout time is a couple milliseconds, pretty quick, and the frequency could be very high. From the information density perspective, camera is one of the best sensors. If you’re using LiDAR, radars, the processing time is pretty long, usually tens or hundreds of milliseconds.”
This puts Xpeng in an interesting middle ground. Tesla famously removed radar and ultrasonic sensors entirely from its vehicles, relying solely on cameras for everything including active safety. Waymo goes the opposite direction with a full LiDAR suite. Xpeng uses cameras exclusively for the driving brain but keeps radar as a separate safety net.
When I asked if the vision system could eventually get so good that the redundant safety layer becomes unnecessary, Liu was blunt: “We hope so, but to be honest, it’s not possible. We all make mistakes. System also makes mistakes. Even you can reach 99.9999%, you still have the chance to make mistakes. Adding another layer of redundancy definitely will help.”
He added: “You’re not talking about chatting with ChatGPT and making a mistake — you just say, ‘hey, this is so silly, redo.’ You’re talking about lives.”
300 million RMB a month on AI training
I asked Liu about the scale of Xpeng’s investment in autonomous driving. His answer was startling for a company that delivered roughly 200,000 vehicles last year.
“There’s a lot of jokes on the internet saying I’m asking a lot of budget from the boss,” Liu said. “He said something like 300 million a month RMB for me to train the model. That’s roughly the truth. I do spend a lot of money.”
That’s roughly $41 million USD per month, or close to $500 million a year on AI model training alone — a significant figure for a company with RMB 47.66 billion ($6.5 billion) in cash at the end of 2025. Liu acknowledged this is unusual for a car company: “As a car company, you can never imagine such a big R&D investment because you can never make the money back. But our company is determined to be a Physical AI company.”
Xpeng disclosed at CVPR that its training infrastructure has achieved a 4,360% gain in single-job training efficiency over the past 12 months, with GPU utilization climbing from 40% to 90%. VLA 2.0 uses billions of parameters and consumes over four trillion tokens per model iteration.
Tesla comparison: ‘same philosophy, different data’
Liu was diplomatic but specific when comparing Xpeng’s approach to Tesla’s FSD.
“I think we share similar underlying philosophy and principles, which is scaling up,” he said. “No matter Tesla or Xpeng or other companies working on this trajectory are working on the same thing — just following the scaling law, make sure you have a system which is data-driven and you can feed unlimited data into the model.”
The key difference, according to Liu, is data diversity. Chinese roads are significantly more chaotic than American ones — a point I experienced firsthand during my 40-minute VLA 2.0 test drive in Beijing, where I encountered more edge cases than I’d see in weeks of driving in North America.
“In China you have a larger chance to get corner cases and data. This is one advantage,” Liu said. He argued this could make it easier for Xpeng to expand internationally than for Tesla to bring FSD to China — not easier per se, but “you have more chances because you have more diverse data.”
The Golden Gate Bridge bet
Xpeng CEO He Xiaopeng made a public bet with Liu last year: if VLA 2.0 doesn’t match Tesla FSD performance by August 30, 2026, Liu has to streak naked across the Golden Gate Bridge.
Liu says he’s safe. “I’m pretty confident I don’t need to run,” he told me. “The condition is reaching parity with Tesla FSD in the beginning of this year. Based on test drives, we already hit the goal.”
He claims Xpeng went from FSD v12 parity to “almost v14 or even better than v13” in just a few months, crediting the team’s rapid iteration speed. The August deadline still stands, but Liu seems relaxed about it.
Google’s Pixel analogy and the Volkswagen deal
Perhaps the most revealing moment came when Liu described Xpeng’s identity. He compared the company to Google making Pixel phones — hardware exists primarily to demonstrate and collect data for the software.
“Producing cars or manufacturing cars definitely is one of the main reasons we are working now,” he said. “We need physical devices in the real world to make sure we get feedback, we get data. Just like Google is producing Pixel devices just trying to demonstrate, ‘OK, this is what Android can do.’ But on the other hand, we want to make sure we are an AI company.”
This framing puts the Volkswagen VLA 2.0 licensing deal in context. Volkswagen became the first external customer for VLA 2.0 earlier this year, with deployment planned for 2027. Liu downplayed the technical challenge of porting the system to VW’s vehicles, noting that Xpeng already pushes OTA updates to more than 20 different car models internally.
“Adding one or two cars for us, it’s not something novel or something new. You train a model and if you generalize it to 20 cars, it doesn’t matter if you have 21, 22, 23.”
The bigger goal, Liu suggested, is getting the entire industry on board: “If only Xpeng or Tesla working on that, it will never become true. You need a lot of partners, you need a lot of friends, you need everyone to accept the truth that automation is coming.”
Electrek’s Take
This interview confirmed what I suspected after my VLA 2.0 test drive in April — Xpeng is running a legitimate autonomous driving program that managed to be quickly genuinely competitive with Tesla’s FSD. All while spending roughly 300 million RMB per month on AI training budget, which might sound like a lot, but in the grand scheme of AI spending, isn’t much.
What struck me most was Liu’s clarity about architecture decisions. The “language is poison” framing sounds like a hot take, but his explanation is technically sound — converting continuous visual signals into discrete language tokens and back is inefficient for a real-time physical control system. It’s a different bet than what most of the industry is making with large language models, and it’s one that VLA 2.0’s on-road performance is starting to validate.
The Pixel analogy is also telling. Xpeng is signaling that it sees vehicle manufacturing as the means, not the end. With Volkswagen already licensing VLA 2.0 and Xpeng reportedly in talks to buy a VW plant in Europe, the company is positioning itself as both a car maker and an AD technology supplier — essentially hedging on which business ultimately generates more value.
To me, it is becoming clear that they are aiming to be a ‘physical AI’ company more than an automaker at this point.
If you want to make sure your EV is charged for less, going solar is one of the smartest moves you can make. With electricity rates climbing nearly 10% last year, home solar protects you against future rate increases. And with lease and PPA options, you can go solar with zero upfront cost and start saving immediately. If you want to find the best deal, check out EnergySage. It’s a free service with hundreds of pre-vetted installers competing for your business, so you save 20 to 30% compared to going it alone. No sales calls until you pick an installer. Get your free quotes here.
FTC: We use income earning auto affiliate links. More.

Comments