yad4u Java modified UTF-8 strings in Python


Java modified UTF-8 strings in Python



I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain utf-8 strings. Java uses a modified utf-8 encoding in DataInputStream.readUTF() which is not supported by python (yet at least)

Can anybody point me in the right direction to construct java modified utf-8 strings in python?

Update #1: To see a little more about the java modified utf-8 check out the readUTF method from the DataInput interface on line 550 here, or here in the Java SE docs.

Update #2: I am trying to interface with a third party JBoss web app which is using this modified utf8 format to read in strings via POST requests by calling DataInputStream.readUTF (sorry for any confusion regarding normal java utf8 string operation).

Thanks in advance.




Passing Python Data to JavaScript via Django

1:



How to integrate JQGrid with Django/Python
You can ignore Modified UTF-8 Encoding (MUTF-8) and just treat it as UTF-8.


Paste text to active window linux
On the Python side, you can just handle it like this,.
Python generators in various languages [closed]
  1. Convert the string into normal UTF-8 and stores bytes in a buffer.
  2. Write the 2-byte buffer length (not the string length) as binary in big-endian.


    RTSP library in Python or C/C++?
  3. Write the whole buffer.
I've done this in PHP and Java didn't complain about my encoding at all (at least in Java 5)..
Is there an C++ equivalent to Python's “import bigname as b”?
MUTF-8 is mainly used for JNI and other systems with null-terminated strings.


How to create a MAPI32.dll stub to be able to “send as attachment” from MS Word?
The only difference from normal UTF-8 is how U+0000 is encoded.


Is there a C# equivalent to Python's unhexlify? [duplicate]
Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80).

First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text.

Secondly, DataInputStream.readUTF() doesn't enforce the encoding so it happily accepts either one.. EDIT: The Python code should look like this,.
def writeUTF(data, str):     utf8 = str.encode('utf-8')     length = len(utf8)     data.append(struct.pack('!H', length))     format = '!' + str(length) + 's'     data.append(struct.pack(format, utf8)) 


2:


Okay, if you need to read the format of DataInput.readUTF, I suspect you'll just have to convert the (well-documented) format into Python.. It doesn't look like it would be particularly hard to do.

After reading the length and then the binary data itself, I suggest you use a first pass to work out how many Unicode characters will be in the output, then construct a string accordingly in a second pass.

Without knowing Python I don't know the ins and outs of how to efficiently construct a string, but given the linked specification I can't imagine it would be very hard.

You might want to look at the source for the existing UTF-8 decoder as a starting point..


3:


Maybe this can help you, although it looks like it's the reverse of what you're doing:. Connecting a Java applet to a python SocketServer.



88 out of 100 based on 68 user ratings 1018 reviews