Take you to analyze Python character encoding and binary! How much do you know about these concepts?

Take you to analyze Python character encoding and binary! How much do you know about these concepts?

Binary

main idea:

Von Neumann + Turing Machine

How does electricity represent the state in order to be stable?

When the computer began to design, it was not about simplicity, but about the reliability of the tasks and results that can be completed automatically.

Simplicity is always based on a stable and reliable foundation

After trying the decimal system, it is difficult to check the current state difference and it is difficult to stabilize the state. The most stable check is

There are two states of energized and non-energized states, then it is defined as 1 when energized, and 0 when not energized, and the state logic of 1 and 0

Bit

So How 0 and 1 represent numbers and characters it?

First find out the characters that need to be represented. There are only more than 100 English characters and numeric characters, and 7 binary digits are needed.

It can be all expressed, but for scalability, one extra bit is added to indicate expansion, which is the ASCII code

Because a character only needs up to 8 binary bits to represent, so 8 bytes are specified as the storage unit, all

8 Bit = 1 Byte

The provisions of the characters represented by numbers, numbers expressed in binary, that is, the characters -> Digital -> binary ,

Then the text information can be stored as binary by the computer, and the binary number stored on the computer can be reversed

Text message

The relationship conversion between decimal to binary is fixed, so the conversion between characters and numbers is called

Character encoding, ASCII code Unicode UTF-8 is to store the mapping relationship between characters and numbers

Figure out a few relationships

1. The relationship between characters and numbers is a mapping relationship, which is an artificial standard

This kind of mapping relationship is common in life, such as

a. ID card information and ID number

b. The database id and the row information

c. Order information and order number

d. Employee ID and employee

e. Dictionary keys and values

f. Memory address and the value stored at that address

...

2. The relationship between numbers and binary, this is like the laws of mathematics or physics, fixed conversion method, hard to write

3. The octal hexadecimal system is based on the binary system, and there is no direct relationship with the decimal system, mainly for

Readability, two representations in binary

For example, binary 00000000 is a storage unit, octal 000 000 000 is converted every 3 binary digits

Transposed decimal representation, the minimum number is 0 and the maximum number is 7, so the value range is 0-7

Hexadecimal 0000 0000 Every 4 binary digits are converted to decimal representation, the minimum digit is 0 and the maximum is 15.

All values are in the range of 0-15, because it is beyond the 10 mechanism to represent the range, so use abcdef to represent 10 11

12 13 14 15

Hexadecimal is often used for memory address to represent IPv6 address color table mac address binary data/x prefix b/B

IP address (32-bit dotted decimal system) xxxx Each x is a decimal number represented by 8 bits

The octal hexadecimal system is based on the binary system

Py base conversion function

Decimal to other bases

Convert to binary bin prefix 0b

To hexadecimal hex prefix 0x

Convert to octal oct prefix 0o

The binary octal hexadecimal system is the prefixed string form "0b/o/x..."

# 10  
number = 9999
print("10 ".ljust(40, "*"))
# 10  2 
b_number = bin(number)
print(" :", b_number)
# 10 8 
o_number = oct(number)
print(" :", o_number)
# 10 16 
h_number = hex(number)
print(" :", h_number)
 

Convert other bases to base 10 int(..., base) base specifies the base

# 10  
number = 9999
print("10 ".ljust(40, "*"))
# 10  2 
b_number = bin(number)
print(" :", b_number)
# 10 8 
o_number = oct(number)
print(" :", o_number)
# 10 16 
h_number = hex(number)
print(" :", h_number)

#  10 
# 2 10 
num_b = int(b_number, base=2)
print(num_b)
# 8  10 
num_o = int(o_number, base=8)
print(num_o)
# 8  16 
num_h = int(h_number, base=16)
print(num_h)
 

String to binary string

bytes

encode

Need to specify the character encoding, the result is prefixed with b/B"..."

#  
song = " "

byte_song = song.encode(encoding="utf-8")
print(byte_song)
#  
eq_byte_song = bytes(song, encoding="utf-8")
print(eq_byte_song)
print(byte_song == eq_byte_song)
 

Binary to string

decode

str

Need to specify character encoding

#  
song = " "
#  
byte_song = song.encode(encoding="utf-8")
print(byte_song)

#  
print(" ".rjust(40, "_"))
dec_song = byte_song.decode(encoding="utf-8")
print(dec_song)
#  '
str_song = str(byte_song, encoding="utf-8")
print(str_song)
print(dec_song == str_song)
 

Arithmetic method

Convert from decimal to 2 8 to hexadecimal, and take the remainder after division

Converting other bases to decimal is to add the specified power of the base from right to left and then sum

The conversion method is like a formula law, fixed

Binary representation

Divided into signed and unsigned types, generally 8 16 32 64 Bit represents an integer or floating point number

Signed highest bit means the sign, which is the leftmost bit, 0 means positive, 1 means negative number, positive and negative subscript bits 0 and 1

Signed bit represents the range, because it is divided into two halves, half means positive and half means negative.

To put it bluntly, it is to remove one bit representing the sign bit -2**(n-1)-2**n(n-1) -1, n = 8/16/32/64

Unsigned bit means 0 to 2**n -1

The length is different, divided into 1/2/4/8 bytes

Py characters correspond to ASCII number functions

ord()

Character Encoding

Language---> Number---> 0 1 Binary

This mapping table is called character encoding

The problem solved by character encoding is the mapping relationship between characters and decimal, which is artificially defined

Chinese gb2312 -> GBK Chinese 2 bytes, English 1 byte

International Unicode (2-4 bytes) -> UTF-8 (1-4 bytes)

1. Support global language characters

2. Contains global character encoding mapping

Languages of various countries in the world can be converted to Unicode, and Unicode can be converted to languages of various countries in the world

3. Global software/hardware support Unicode

Mainstream UTF-8

Because Unicode means that a character requires at least 2 bytes, so the original ASCII only requires one byte.

Now that Unicode encoding is used, the storage space required for storage and network transmission is directly doubled, which is unacceptable

In order to solve this problem, UTF-8 has embarked on the stage of history. Well, network transmission and storage use

UTF-8, the operating system supports Unicode, so efficient transmission, storage and support of global language systems become possible

Coding in Python

First of all, what is the sacred coding in Python?

Let s look at the files that store the code and the files that the code is loaded into the memory and then processed by the interpreter

The code we type is actually text data in essence

Text data should be converted into binary through a certain encoding table and then stored on the hard disk

Binary data stored on the computer also needs a coding table to be converted into text data

What is coding in Python?

The default file encoding in Py3 is UTF-8, and when we edit files through the editor, there will also be a default encoding

Generally, the default is UTF-8. If the text data in the defined file is not encoded in UTF-8, it needs to be in Py

The header line of the file tells the Py interpreter what encoding the file is.

What the interpreter reads is not the text data in the editor we see, but 01 stored on the hard disk

The same binary data, the interpreter tries to use the default UTF-8 encoding to decode the binary number read to the hard disk

According to data, converted to file data, if it is not the default utf-8, garbled characters appear, and the interpreter fails to parse the text data.

You need to specify the encoding format of the current file at the beginning of the Py source file, and tell the Py interpreter how to convert the file

The default encoding of Py interpreter is Unicode, and the interpreter will convert the binary data read through character encoding.

Change to file data and then convert to Unicode again, as long as the operating system supports Unicode, the interpreter

Can execute normally and output the result

Interpreter

Binary Data -> Check Character Encoding Table -> Text Data -> Unicode Encoded Text Data

editor

Binary data -> check character encoding table -> text data corresponding to the encoding table

Both the interpreter and the editor start from the binary data of the file and convert it into the corresponding text through encoding

Data, but the interpreter will parse the text data into the underlying machine instructions based on the file data and execute

What needs to be clarified is that the encoding of the Py source file is inconsistent with the default encoding of the Py interpreter

The default encoding of Py source files is UTF-8, and the default encoding of Py interpreter is Unicode

Then, the idea of solving the problem that produces garbled codes is a good solution

Garbled-The character encoding is specified incorrectly, and the stored binary is converted into a text file. The selected character set is incorrect

1. For C/S architecture software, check whether the default codes of Client and Server are the same

2. Web back-end, whether the default encoding of the database, the encoding of the table, and the encoding of each language connection database interface are consistent

3. File, check whether the default encoding of the editor is consistent with the initial encoding of the file, and store whatever encoding is used for reading

The way Python declares the character encoding of the source file

1. # conding:utf-8

# 2. - - conding: UTF-8 - -

All start with # and are written on the top line of the source file

# -*- coding:utf-8 -*-
# coding: utf-8
 

Want the source code or want to know more click here to get

This article is reproduced, the copyright belongs to the author, if there is any infringement, please contact the editor to delete it!

 Original address: www.tuicool.com/articles/Rz...