Sorry, this is my first post on any forum.
I’m a rather shy person, but I’m willing to talk about my project. I just need to know if I’m doing everything right or not.
Because I haven’t received any feedback from people yet.
——————————————————————————————-
Hi all, my name is siritoriyowai and I’m the creator of the Serenade Language (it’s currently private, so this post isn’t an advertisement)
Let’s get straight to the point.
The language itself offers the ease of Python but with the power of C++, Assembly, CUDA, and Go
The language natively supports: 1. Search Related AI, 2. Game Engine on basic OpenGL Compatible Profile, 3. Almost all basic algorithms (Bubble Sort, calculators, etc.)
——————————————————————————————-
I’m using Claude and ChatGPT to write this language because my idea is incredible speed, and I want to bring it to life as soon as possible
Recently, a real desire to surpass the cuBLAS library has struck me. In the GEMM (matrix multiplication) discipline
I’m really trying to do this, and I currently have the best logs and personal notes about it. The architecture itself is built using WMMA tensor cores, a cp asyns pipeline, warps tiling, double buffering, and shared memory padding
——————————————————————————————-
There are 40 versions in total, and the current result is 34.6 TFLOPS (if I calculated correctly, this is one of the limits on a home-made graphics card without overclocking. The cuda code using cuBLAS gave me around 35.8 TFLOPS, specifically in fp16 input fp32 accumulate)
Here’s the log for the best version: {
‘’'v13 BEST PEAK
fixed the fake cudaErrorInvalidValue source, removed cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize) from init since it likely failed and poisoned the error state, replaced with a cleanup of cudaGetLastError so the error is gone. cleared
big compute side win, preloaded wmma fragments, load a fragments once per wm and b fragments once per wn, then do the 8 mma ops without reloading fragments every inner loop, reduced smem to reg traffic hard
result is best so far, 35.7 TFLOPS, also no more cudaErrorInvalidValue line, finally clean
‘’‘ptxas info : Used 106 registers, used 1 barriers, 20992 bytes smem, 392 bytes cmem[0]
[gpu gemm] 8192x8192x8192 | serenaCore | 40 runs
[gpu gemm] avg=30.831 ms 35662.2 GFLOPS
c[0]=8192.000000’‘’
‘’’
}
——————————————————————————————-
Also a little about syntax, I’ll take this from old personal notes
——————————————————————————————-
- Atomics (versus race conditions)
- Syntactic sugar (kb, mb, gb - calculated immediately)
- Memory arena allocation, which is when two languages share memory without conflicts, like garbage collector in Golang, because if it shares memory with C++ without optimization, it will simply delete data, so it’s somewhat isolated
- Native block, which allows you to fully (or not so fully) write in C++. If there’s a function you really need but it’s not in the language, you can call it using pluses and everything will work fine
- Elif and else (there weren’t any, there was just one if).
- while loops,
- foreach (like cycle but with stepping),
- match and case (work with any data type, but better than switch in C++ since it only works with integers)
——————————————————————————————-
And here’s what a full-fledged GEMM test looks like in the same language
’’’
sropstart()
const N = 8192
let a = @f32[N * N]
let b = @f32[N * N]
let c = @f32[N * N]
emit “filling {N}x{N}”
cycle N * N as i {
ai = 1.0bi = 1.0ci = 0.0}gpu bench c a b N N N
srcheckopr()
emit "c[0] = {c^[0]}"
'‘‘
Thanks to those who read and paid attention!