I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain utf-8 strings. Java uses a modified utf-8 encoding in DataInputStream.readUTF() which is not supported by python (yet at least)
Can anybody point me in the right direction to construct java modified utf-8 strings in python?
Update #2: I am trying to interface with a third party JBoss web app which is using this modified utf8 format to read in strings via POST requests by calling DataInputStream.readUTF (sorry for any confusion regarding normal java utf8 string operation).
Thanks in advance.
How to integrate JQGrid with Django/Python
Paste text to active window linux
On the Python side, you can just handle it like this,.
Python generators in various languages [closed]
- Convert the string into normal UTF-8 and stores bytes in a buffer.
- Write the 2-byte buffer length (not the string length) as binary in big-endian.
RTSP library in Python or C/C++?
- Write the whole buffer.
Is there an C++ equivalent to Python's “import bigname as b”?
MUTF-8 is mainly used for JNI and other systems with null-terminated strings.
How to create a MAPI32.dll stub to be able to “send as attachment” from MS Word?
The only difference from normal UTF-8 is how U+0000 is encoded.
Is there a C# equivalent to Python's unhexlify? [duplicate]
Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80).
First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text.
DataInputStream.readUTF()doesn't enforce the encoding so it happily accepts either one.. EDIT: The Python code should look like this,.
def writeUTF(data, str): utf8 = str.encode('utf-8') length = len(utf8) data.append(struct.pack('!H', length)) format = '!' + str(length) + 's' data.append(struct.pack(format, utf8))
DataInput.readUTF, I suspect you'll just have to convert the (well-documented) format into Python.. It doesn't look like it would be particularly hard to do.
After reading the length and then the binary data itself, I suggest you use a first pass to work out how many Unicode characters will be in the output, then construct a string accordingly in a second pass.
Without knowing Python I don't know the ins and outs of how to efficiently construct a string, but given the linked specification I can't imagine it would be very hard.
You might want to look at the source for the existing UTF-8 decoder as a starting point..