Technology

42816 readers
73 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 7 years ago
MODERATORS
1
2
3
4
5
 
 

DSpark is genuinely one of the more elegant solutions to the speculative decoding bottleneck I have seen lately. The core issue here the tradeoff between draft speed and draft quality. Autoregressive drafters like Eagle3 get great acceptance rates but their latency scales linearly with block size which forces you to use short drafts. On the other hand, parallel drafters like DFlash fix the speed issue by doing everything in one forward pass but they fall apart on longer sequences because the tokens are predicted independently. If the model is unsure whether to output "of course" or "no problem" a parallel drafter might hallucinate "of problem" because it's not aware of its own previous token choices causing a massive dropoff in acceptance rates for later tokens in a block.

DSpark uses a semi autoregressive approach doing the bulk of the computation in parallel and then attaching a lightweight sequential head to handle local token transitions. Combining the approaches solves the cross mode collision issue and keeps the conditional acceptance rate high all the way to the end of a long draft block while adding basically zero latency overhead. A shallow two layer DSpark model actually outperformed a heavier five layer DFlash baseline simply because the sequential modeling is so much more parameter efficient.

The second major finding addresses the system level bottlenecks of serving these models in production. Even if you can draft a ton of tokens quickly verifying a massive block of low confidence text wastes GPU cycles. DSpark introduces a confidence head that estimates the survival probability of each draft token and pairs it with a hardware aware scheduler so that instead of blindly verifying a fixed block length it looks at the current load and the confidence scores to dynamically truncate the draft. Using load adaptive scheduling allowed DeepSeek to boost per user generation speeds by roughly 60 to 85% at matched throughput levels compared to their previous production baseline. They also open sourced the training repo and the checkpoints which is a huge win for the community.

6
7
8
9
10
11
4
When the Moat Becomes a Cage (dialecticaldispatches.substack.com)
submitted 1 day ago by yogthos@lemmy.ml to c/technology@lemmy.ml
12
13
14
 
 

The absolute legends from pro-Palestine campaign group Your Tech Their Deaths (YTTD) have disrupted a Dublin conference hosted by the disgraced Microsoft, a company that has operated as the tech backbone of so-called ‘Israel’s’ holocaust in Gaza.

Activists at YTTD rightly aren’t prepared to tolerate this corporate abomination operating freely in Ireland. As a Microsoft employee was giving a presentation, an activist interrupted proceedings while holding a Palestinian flag. She said:

One of the customers of Microsoft is the Israeli military. Microsoft uses their computing AI Azure to help Israeli military target and surveil Palestinians. Just like IBM was helping [the Nazis] to target the Jews, Microsoft is helping to target Palestinians in Palestine and Gaza. Microsoft is complicit with their cloud and their technology in killing Palestinians.*

When a man asked her to stop, she replied: “Then you need to stop your contracts”. The activist urged Microsoft employees to join No Azure for Apartheid, as a second activist joined in.

The action was a success, causing a complete halt to the Microsoft presentation. Nearly all audience members left the room. The Canary spoke to YTTD founder Jude Farrell, who suggested that maybe “they were embarrassed to be there in the first place”. When the talk restarted, other activists who had previously remained silent then disrupted proceedings once more.

15
16
17
18
19
20
21
 
 

DualPath is a system developed by DeepSeek to address the storage input and output bottleneck that slows down agentic LLM inference. When LLMs run as agents they need to repeatedly interact with their environments over many turns which builds up a massive context history stored as a KV-Cache. Most current systems split the workload into prefill engines that process new prompt tokens and decode engines that generate the actual responses. The fundamental issue is that prefill engines have to load KV-Cache directly from external persistent storage which maxes out network bandwidth on the prefill side while the storage network connections on the decode engines sit idle.

DualPath creaties a second route for the data which allows the system to load KV-Cache from storage into the idle decoding engines first. Once the data hits the decode engines it gets forwarded to the prefill engines using a fast compute network connecting the graphics processing units. It's basically a routing strategy for aggregating the storage bandwidth across all the machines and stop the prefill nodes from becoming a choke point.

A traffic manager places the KV-Cache transfers onto a lower priority virtual lane so that the actual inference communication gets majority of the bandwidth priority while data shuffling happens in the background without causing latency spikes. A dynamic scheduler then constantly monitors token counts and queue lengths to distribute the reading tasks evenly across all available hardware. In teests, DualPath improved system throughput by nearly two times compared to a standard setup. Turns out that properly balancing network traffic that was already available in the cluster makes multi-turn agent workloads dramatically faster.

22
23
24
25
view more: next ›