JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama
Briefly

The decision by Andrej Karpathy to open-source the 700-lines llama.c inference interface demystified how developers can interact with LLMs. The public repository took off counting thousands of stars, forks and ports to other languages.
JLama is the first pure Java inference available in Maven Central. The implementation leverages the Vector API and PanamaTensorOperations class with native fallback.
Released under Apache License, the project was built using Java 21 and the new Vector API promising faster inference. JLama has implemented features like distributed inference, flash attention, mixture of experts, Hugging Face SafeTensors model, and Tokenizer format.
For those who want 'to just chat with a large language model,' the library provides a simple web UI accessible at http://localhost:8080/ui/index.html.
Read at InfoQ
[
]
[
|
]