Japanese InstructBLIP Alpha
Model Details
Japanese InstructBLIP Alpha is a vision-language instruction-following model that enables to generate Japanese descriptions for input images and optionally input texts such as questions.
Usage
First install additional dependencies in requirements.txt:
pip install sentencepiece einops
import torch
from transformers import LlamaTokenizer, AutoModelForVision2Seq, BlipImageProcessor
from PIL import Image
import requests
# helper function to format input prompts
def build_prompt(prompt="", sep="\n\n### "):
sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
p = sys_msg
roles = ["指示", "応答"]
user_query = "与えられた画像について、詳細に述べてください。"
msgs = [": \n" + user_query, ": "]
if prompt:
roles.insert(1, "入力")
msgs.insert(1, ": \n" + prompt)
for role, msg in zip(roles, msgs):
p += sep + role + msg
return p
# load model
model = AutoModelForVision2Seq.from_pretrained("stabilityai/japanese-instructblip-alpha", trust_remote_code=True)
processor = BlipImageProcessor.from_pretrained("stabilityai/japanese-instructblip-alpha")
tokenizer = LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1", additional_special_tokens=['▁▁'])
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# prepare inputs
url = "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = "" # input empty string for image captioning. You can also input questions as prompts
prompt = build_prompt(prompt)
inputs = processor(images=image, return_tensors="pt")
text_encoding = tokenizer(prompt, add_special_tokens=False, return_tensors="pt")
text_encoding["qformer_input_ids"] = text_encoding["input_ids"].clone()
text_encoding["qformer_attention_mask"] = text_encoding["attention_mask"].clone()
inputs.update(text_encoding)
# generate
outputs = model.generate(
**inputs.to(device, dtype=model.dtype),
num_beams=5,
max_new_tokens=32,
min_length=1,
)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)
# 桜と東京スカイツリー
Model Details
- Developed by: Stability AI
- Model type: InstructBLIP
- Language(s): Japanese
- License: JAPANESE STABLELM RESEARCH LICENSE AGREEMENT.
Training
Japanese InstructBLIP Alpha leverages the InstructBLIP architecture. It consists of 3 components: a frozen vision image encoder, a Q-Former, and a frozen LLM. The vision encoder and the Q-Former were initialized with Salesforce/instructblip-vicuna-7b. For the frozen LLM, Japanese-StableLM-Instruct-Alpha-7B model was used. During training, only Q-Former was trained.
Training Dataset
The training dataset includes the following public datasets:
- CC12M with captions translated into Japanese
- MS-COCO with STAIR Captions
- Japanese Visual Genome VQA dataset
Use and Limitations
Intended Use
This model is intended to be used by the open-source community in chat-like applications in adherence with the research license.
Limitations and bias
Although the aforementioned datasets help to steer the base language models into "safer" distributions of text, not all biases and toxicity can be mitigated through fine-tuning. We ask that users be mindful of such potential issues that can arise in generated responses. Do not treat model outputs as substitutes for human judgment or as sources of truth. Please use responsibly.
How to cite
@misc{JapaneseInstructBLIPAlpha,
url = {[https://Model Database.co/stabilityai/japanese-instructblip-alpha](https://Model Database.co/stabilityai/japanese-instructblip-alpha)},
title = {Japanese InstructBLIP Alpha},
author = {Shing, Makoto and Akiba, Takuya}
}
Citations
@misc{dai2023instructblip,
title = {InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning},
author = {Wenliang Dai and Junnan Li and Dongxu Li and Anthony Meng Huat Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven Hoi},
year = {2023},
eprint = {2305.06500},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
- Downloads last month
- 6,206
Inference API does not yet support model repos that contain custom code.